Empowering Lightweight MLLMs with Reasoning via Long CoT SFT
Linyu Ou, YuYang Yin
TL;DR
This work tackles reasoning in lightweight multimodal language models by showing that a long chain-of-thought supervised fine-tuning stage dramatically enhances reasoning capabilities, especially when paired with a subsequent reinforcement learning phase. By constructing a ZPD-informed dataset with graded difficulty and applying a modified GRPO framework, the authors demonstrate a clear SFT-before-RL synergy that yields state-of-the-art results on multiple math-centric benchmarks at both 3B and 7B scales. Key findings include the primacy of long-CoT data over modality, the importance of data difficulty in driving efficient learning, and the robustness of a two-stage pipeline. The work suggests a practical path for deploying lightweight MLLMs with strong reasoning abilities and provides curated data and code to support future research.
Abstract
While Reinforcement Learning with Verifiable Rewards has enhanced the reasoning of large-scale language models (LLMs), its efficacy for lightweight multimodal language models (MLLMs) with fewer than seven billion parameters remains underexplored. This paper investigates the role of long Chain-of-Thought (long CoT) data in enhancing the reasoning abilities of such MLLMs. Our findings demonstrate that Supervised Fine-Tuning (SFT) with long CoT data significantly improves MLLM reasoning. Furthermore, we observe that after this initial SFT phase, MLLMs can achieve additional performance gains through a subsequent RL stage. We conclude that a SFT stage with long CoT data is a critical prerequisite for developing the reasoning capabilities of lightweight MLLMs.
