InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Congkai Xie, Shuo Cai, Wenjun Wang, Pengxiang Li, Zhijie Sang, Kejing Yang, Yiming Zhang, Zhen Li, Guanghao Zhu, Zeyu Liu, Yang Yu, Yuhang Liu, Su Lu, Baoyi He, Qi Zhou, Xiaotian Han, Jianbo Yuan, Shengyu Zhang, Fei Wu, Hongxia Yang
TL;DR
InfiR demonstrates that small language models (SLMs) and multimodal SLMs (MSLMs) can achieve competitive reasoning with substantially lower compute and privacy costs than large LLMs. The authors implement a rigorous, data-centric training pipeline comprising high-quality pretraining data, an annealing phase, and carefully engineered supervised fine-tuning, enabling edge-deployable models such as InfiR-1B-Base, InfiR-1B-Instruct, and InfiR-VL-1.6B to outperform stronger baselines on reasoning benchmarks and Android-world tasks. A dedicated multimodal pipeline further aligns vision and language with a compact backbone, delivering strong general and domain-specific reasoning via curriculum learning and long-CoT data. Together, these contributions advance practical, efficient AI systems with robust reasoning capabilities suitable for on-device deployment and privacy-conscious applications.
Abstract
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have made significant advancements in reasoning capabilities. However, they still face challenges such as high computational demands and privacy concerns. This paper focuses on developing efficient Small Language Models (SLMs) and Multimodal Small Language Models (MSLMs) that retain competitive reasoning abilities. We introduce a novel training pipeline that enhances reasoning capabilities and facilitates deployment on edge devices, achieving state-of-the-art performance while minimizing development costs. \InfR~ aims to advance AI systems by improving reasoning, reducing adoption barriers, and addressing privacy concerns through smaller model sizes. Resources are available at https://github. com/Reallm-Labs/InfiR.
