Table of Contents
Fetching ...

MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification

Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tianpeng Li, Fan Yang, Zenan Zhou, Wentao Zhang

TL;DR

The paper tackles the challenge of robust multimodal mathematical reasoning by introducing MM-Verifier and MM-Reasoner, accompanied by two-stage data synthesis to produce long MMCOT data and verification data. It combines a simulation-based long-CoT data-generation pipeline with rejection sampling, then distills long-CoT reasoning from text-only models to build a scalable MM-Reasoner. The MM-Verifier achieves state-of-the-art performance on MathCheck and strong results on MathVista and MathVerse, while the MM-Reasoner shows scalable improvements with more data; together they surpass GPT-4o on MathVista. These results demonstrate the effectiveness of verification-driven multimodal reasoning and offer a practical path toward data-efficient, high-performing multimodal math solvers.

Abstract

According to the Test-Time Scaling, the integration of External Slow-Thinking with the Verify mechanism has been demonstrated to enhance multi-round reasoning in large language models (LLMs). However, in the multimodal (MM) domain, there is still a lack of a strong MM-Verifier. In this paper, we introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification. First, we propose a two-step MM verification data synthesis method, which combines a simulation-based tree search with verification and uses rejection sampling to generate high-quality Chain-of-Thought (COT) data. This data is then used to fine-tune the verification model, MM-Verifier. Additionally, we present a more efficient method for synthesizing MMCOT data, bridging the gap between text-based and multimodal reasoning. The synthesized data is used to fine-tune MM-Reasoner. Our MM-Verifier outperforms all larger models on the MathCheck, MathVista, and MathVerse benchmarks. Moreover, MM-Reasoner demonstrates strong effectiveness and scalability, with performance improving as data size increases. Finally, our approach achieves strong performance when combining MM-Reasoner and MM-Verifier, reaching an accuracy of 65.3 on MathVista, surpassing GPT-4o (63.8) with 12 rollouts.

MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification

TL;DR

The paper tackles the challenge of robust multimodal mathematical reasoning by introducing MM-Verifier and MM-Reasoner, accompanied by two-stage data synthesis to produce long MMCOT data and verification data. It combines a simulation-based long-CoT data-generation pipeline with rejection sampling, then distills long-CoT reasoning from text-only models to build a scalable MM-Reasoner. The MM-Verifier achieves state-of-the-art performance on MathCheck and strong results on MathVista and MathVerse, while the MM-Reasoner shows scalable improvements with more data; together they surpass GPT-4o on MathVista. These results demonstrate the effectiveness of verification-driven multimodal reasoning and offer a practical path toward data-efficient, high-performing multimodal math solvers.

Abstract

According to the Test-Time Scaling, the integration of External Slow-Thinking with the Verify mechanism has been demonstrated to enhance multi-round reasoning in large language models (LLMs). However, in the multimodal (MM) domain, there is still a lack of a strong MM-Verifier. In this paper, we introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification. First, we propose a two-step MM verification data synthesis method, which combines a simulation-based tree search with verification and uses rejection sampling to generate high-quality Chain-of-Thought (COT) data. This data is then used to fine-tune the verification model, MM-Verifier. Additionally, we present a more efficient method for synthesizing MMCOT data, bridging the gap between text-based and multimodal reasoning. The synthesized data is used to fine-tune MM-Reasoner. Our MM-Verifier outperforms all larger models on the MathCheck, MathVista, and MathVerse benchmarks. Moreover, MM-Reasoner demonstrates strong effectiveness and scalability, with performance improving as data size increases. Finally, our approach achieves strong performance when combining MM-Reasoner and MM-Verifier, reaching an accuracy of 65.3 on MathVista, surpassing GPT-4o (63.8) with 12 rollouts.

Paper Structure

This paper contains 31 sections, 2 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Our 7B MM-Verifier outperform all other models, even large models like GPT-4o, Gemini and Claude on the MathCheck Outcome-Judging benchmark.
  • Figure 2: We present the pipeline for synthesizing MM-Verifier data. In Stage 1, we use a simulation-based algorithm for long-chain CoT reasoning and long verification. In Stage 2, we use the trained Verifier model from Stage 1 to further enhance it using rejection sampling, generating more long CoT verification data.
  • Figure 3: Answer length of direct sampling and simulated-based search. We can see the simulated-based search can synthesize longer COT answers.
  • Figure 4: The performance of our MM-Reasoner can scale up using different MM-Verifiers. We can see with different scale MM-Reasoner the MM-Verifier consistently outperform majority voting and MM-Verifier Stage1.
  • Figure 5: We present a case of MM-Verifier. We can see MM-Verifier correctly verify the answer with Long COT while Qwen2-VL-72B-Instruct failed to.
  • ...and 4 more figures