Table of Contents
Fetching ...

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang

TL;DR

This work tackles multimodal reasoning in large language models and questions whether reflective 'aha moment' signals genuinely indicate advanced reasoning. It proposes a two-stage approach: first cold-start supervised fine-tuning with structured Chain-of-Thought data, then reinforcement learning via GRPO to refine capabilities. Extensive experiments on four visual-math benchmarks show that the SFT+RL pipeline yields state-of-the-art open-source performance at both 3B and 7B scales, with notable gains over SFT-only and RL-only baselines. The results offer practical guidance on cold-start data design, CoT strategies, and the role of data quality in scalable multimodal reasoning.

Abstract

Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % on MathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

TL;DR

This work tackles multimodal reasoning in large language models and questions whether reflective 'aha moment' signals genuinely indicate advanced reasoning. It proposes a two-stage approach: first cold-start supervised fine-tuning with structured Chain-of-Thought data, then reinforcement learning via GRPO to refine capabilities. Extensive experiments on four visual-math benchmarks show that the SFT+RL pipeline yields state-of-the-art open-source performance at both 3B and 7B scales, with notable gains over SFT-only and RL-only baselines. The results offer practical guidance on cold-start data design, CoT strategies, and the role of data quality in scalable multimodal reasoning.

Abstract

Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %73.4 % on MathVista, 62.9 %70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.

Paper Structure

This paper contains 32 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Performance comparison between our models and other advanced models on different multimodal reasoning benchmarks at both the 3B and 7B scales.
  • Figure 2: The frequency and accuracy of models' responses with and without "aha moment". The results show that the presence of "aha moment" does not necessarily correlate with higher accuracy.
  • Figure 3: Method overview. Our approach consists of two stages: (1) a cold start phase using supervised fine-tuning with Chain-of-Thought data, and (2) a reinforcement learning phase using GRPO to further enhance reasoning capabilities.
  • Figure 4: Comparison of model performance when trained on data with "aha moment" patterns (Reflection-CoT v2) versus randomly selected 32B-distilled data. Model trained on randomly selected data consistently outperform that trained on "aha moment" data, suggesting that these reflective patterns do not necessarily correlate with advanced reasoning capabilities.