Table of Contents
Fetching ...

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Kaichen Zhang, Keming Wu, Zuhao Yang, Bo Li, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, Lidong Bing

TL;DR

OpenMMReasoner presents a transparent, two-stage open recipe for multimodal reasoning that unifies supervised fine-tuning and reinforcement learning. By building a high-quality 874k SFT dataset and a 74k RL dataset, coupled with systematic data distillation, cross-domain mixing, and algorithmic exploration (GSPO, DAPO, GRPO), it achieves strong, scalable reasoning across nine benchmarks and surpasses baselines such as Qwen2.5-VL-7B-Instruct. The work emphasizes data quality, diversity, and training design as key drivers of performance, while delivering fully open-source data pipelines, weights, and evaluation protocols. Practically, this enables reproducible, enterprise-friendly multimodal reasoning with improved reliability and cross-domain transfer, including textual reasoning improvements observed during RL. The results highlight the value of transparent pipelines and data-driven design choices for advancing scalable multimodal reasoning research and applications.

Abstract

Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

TL;DR

OpenMMReasoner presents a transparent, two-stage open recipe for multimodal reasoning that unifies supervised fine-tuning and reinforcement learning. By building a high-quality 874k SFT dataset and a 74k RL dataset, coupled with systematic data distillation, cross-domain mixing, and algorithmic exploration (GSPO, DAPO, GRPO), it achieves strong, scalable reasoning across nine benchmarks and surpasses baselines such as Qwen2.5-VL-7B-Instruct. The work emphasizes data quality, diversity, and training design as key drivers of performance, while delivering fully open-source data pipelines, weights, and evaluation protocols. Practically, this enables reproducible, enterprise-friendly multimodal reasoning with improved reliability and cross-domain transfer, including textual reasoning improvements observed during RL. The results highlight the value of transparent pipelines and data-driven design choices for advancing scalable multimodal reasoning research and applications.

Abstract

Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.

Paper Structure

This paper contains 41 sections, 5 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: Rollout Analysis over RL. With the progress of RL training, the model response contains more reflection word ratio.
  • Figure 2: Data Pipelines of OpenMMReasoner. We propose two training recipes covering both the SFT and RL phases. The pipeline begins by collecting diverse data sources and selecting teacher models to generate new answer traces. During the RL phase, we explore different algorithm choices and filtering strategies, leading to our final optimized recipe.
  • Figure 3: Data Source Distribution OpenMMReasoner. Our dataset comprises diverse sources across multiple domains, aiming to balance data diversity and efficiency for optimal performance.
  • Figure 4: Overall results across different algorithms. We conduct a systematic comparison of various algorithms under identical multimodal RL training settings. GSPO demonstrates the highest training stability, exploration capability, and overall efficiency.
  • Figure 5: Training dynamics on the validation set during RL. During RL training, we observe that textual reasoning ability improves alongside visual reasoning, even when trained solely on multimodal data, indicating strong cross-domain generalization of reasoning capabilities.
  • ...and 2 more figures