Table of Contents
Fetching ...

InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

Congkai Xie, Shuo Cai, Wenjun Wang, Pengxiang Li, Zhijie Sang, Kejing Yang, Yiming Zhang, Zhen Li, Guanghao Zhu, Zeyu Liu, Yang Yu, Yuhang Liu, Su Lu, Baoyi He, Qi Zhou, Xiaotian Han, Jianbo Yuan, Shengyu Zhang, Fei Wu, Hongxia Yang

TL;DR

InfiR demonstrates that small language models (SLMs) and multimodal SLMs (MSLMs) can achieve competitive reasoning with substantially lower compute and privacy costs than large LLMs. The authors implement a rigorous, data-centric training pipeline comprising high-quality pretraining data, an annealing phase, and carefully engineered supervised fine-tuning, enabling edge-deployable models such as InfiR-1B-Base, InfiR-1B-Instruct, and InfiR-VL-1.6B to outperform stronger baselines on reasoning benchmarks and Android-world tasks. A dedicated multimodal pipeline further aligns vision and language with a compact backbone, delivering strong general and domain-specific reasoning via curriculum learning and long-CoT data. Together, these contributions advance practical, efficient AI systems with robust reasoning capabilities suitable for on-device deployment and privacy-conscious applications.

Abstract

Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have made significant advancements in reasoning capabilities. However, they still face challenges such as high computational demands and privacy concerns. This paper focuses on developing efficient Small Language Models (SLMs) and Multimodal Small Language Models (MSLMs) that retain competitive reasoning abilities. We introduce a novel training pipeline that enhances reasoning capabilities and facilitates deployment on edge devices, achieving state-of-the-art performance while minimizing development costs. \InfR~ aims to advance AI systems by improving reasoning, reducing adoption barriers, and addressing privacy concerns through smaller model sizes. Resources are available at https://github. com/Reallm-Labs/InfiR.

InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

TL;DR

InfiR demonstrates that small language models (SLMs) and multimodal SLMs (MSLMs) can achieve competitive reasoning with substantially lower compute and privacy costs than large LLMs. The authors implement a rigorous, data-centric training pipeline comprising high-quality pretraining data, an annealing phase, and carefully engineered supervised fine-tuning, enabling edge-deployable models such as InfiR-1B-Base, InfiR-1B-Instruct, and InfiR-VL-1.6B to outperform stronger baselines on reasoning benchmarks and Android-world tasks. A dedicated multimodal pipeline further aligns vision and language with a compact backbone, delivering strong general and domain-specific reasoning via curriculum learning and long-CoT data. Together, these contributions advance practical, efficient AI systems with robust reasoning capabilities suitable for on-device deployment and privacy-conscious applications.

Abstract

Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have made significant advancements in reasoning capabilities. However, they still face challenges such as high computational demands and privacy concerns. This paper focuses on developing efficient Small Language Models (SLMs) and Multimodal Small Language Models (MSLMs) that retain competitive reasoning abilities. We introduce a novel training pipeline that enhances reasoning capabilities and facilitates deployment on edge devices, achieving state-of-the-art performance while minimizing development costs. \InfR~ aims to advance AI systems by improving reasoning, reducing adoption barriers, and addressing privacy concerns through smaller model sizes. Resources are available at https://github. com/Reallm-Labs/InfiR.

Paper Structure

This paper contains 41 sections, 6 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: The pipeline of pretrain data drocesses: heuristic filtering, reasoning-oriented text recall, deduplication, quality assessment and decontamination. Comparative experiments on LLaMA3.2-1B with differently cleaned datasets validate the significance of data quality.
  • Figure 2: Supervised fine-tuning data synthesis pipeline. The pipeline initiates with a set of high-quality seed data, which is augmented through instruction evolution. Response candidates are generated using the Qwen-2.5-32B-Instruct model, followed by rejection sampling with a reward model and sandbox environment. Finally, we score the curated data for quality and difficulty, and assign domain labels.
  • Figure 3: Illustration of the MSLM training pipeline and the MSLM training details, showcasing the progression from captioning and QA tasks to text rendering, followed by instruction-tuning, culminating in enhanced mathematical and operating system reasoning abilities.
  • Figure 4: Left: multi-programming language distribution. Right: similarity histogram of 2500 image-text pairs sampled from the coco-caption dataset