Table of Contents
Fetching ...

MMC: Iterative Refinement of VLM Reasoning via MCTS-based Multimodal Critique

Shuhang Liu, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Qing Wang, Jianshu Zhang, Quan Liu, Jianqing Gao, Feng Ma

TL;DR

This work tackles reasoning reliability in visual language models by introducing an external critique mechanism guided by Monte Carlo Tree Search. An actor–critic framework enables iterative, stepwise reasoning refinement where the critic provides targeted feedback to correct reasoning errors. The authors automate critique data construction via MCTS, generating MMC datasets that enable fine-grained step-level supervision without heavy manual annotation. Empirical results across diverse multimodal benchmarks show substantial gains, demonstrating the approach's generalization and practical impact for improving complex multimodal reasoning tasks.

Abstract

Visual language models (VLMs) have demonstrated strong performance across diverse multimodal reasoning tasks but still face challenges such as hallucinations, resulting in incorrect reasoning outcomes. Inspired by recent research on external feedback mechanisms in large language models (LLMs), we propose a multimodal actor-critic framework to enhance VLM reasoning capabilities. Specifically, the actor model generates step-by-step reasoning paths based on image and text inputs, while the critic model evaluates these reasoning paths and provides corrective feedback. The actor model iteratively refines its reasoning based on the feedback until the reasoning outcome is deemed satisfactory by the critic model. To reduce reliance on costly manual annotations, we introduce an automated method for constructing multimodal critique datasets. By leveraging Monte Carlo Tree Search (MCTS), we systematically guide the actor model to explore diverse reasoning paths. To obtain critique data for correcting erroneous reasoning steps, we prompt an annotator model to compare pairs of reasoning paths diverging from a shared ancestor node - one leading to a correct conclusion and the other to an incorrect one. This approach enables us to construct the MMC (MCTS-based Multimodal Critique) dataset, upon which we further develop a comprehensive training and inference pipeline. Extensive experiments conducted on several public benchmark datasets and mainstream VLMs demonstrate that our approach significantly improves the performance of VLM on complex multimodal reasoning tasks, underscoring its effectiveness and wide applicability.

MMC: Iterative Refinement of VLM Reasoning via MCTS-based Multimodal Critique

TL;DR

This work tackles reasoning reliability in visual language models by introducing an external critique mechanism guided by Monte Carlo Tree Search. An actor–critic framework enables iterative, stepwise reasoning refinement where the critic provides targeted feedback to correct reasoning errors. The authors automate critique data construction via MCTS, generating MMC datasets that enable fine-grained step-level supervision without heavy manual annotation. Empirical results across diverse multimodal benchmarks show substantial gains, demonstrating the approach's generalization and practical impact for improving complex multimodal reasoning tasks.

Abstract

Visual language models (VLMs) have demonstrated strong performance across diverse multimodal reasoning tasks but still face challenges such as hallucinations, resulting in incorrect reasoning outcomes. Inspired by recent research on external feedback mechanisms in large language models (LLMs), we propose a multimodal actor-critic framework to enhance VLM reasoning capabilities. Specifically, the actor model generates step-by-step reasoning paths based on image and text inputs, while the critic model evaluates these reasoning paths and provides corrective feedback. The actor model iteratively refines its reasoning based on the feedback until the reasoning outcome is deemed satisfactory by the critic model. To reduce reliance on costly manual annotations, we introduce an automated method for constructing multimodal critique datasets. By leveraging Monte Carlo Tree Search (MCTS), we systematically guide the actor model to explore diverse reasoning paths. To obtain critique data for correcting erroneous reasoning steps, we prompt an annotator model to compare pairs of reasoning paths diverging from a shared ancestor node - one leading to a correct conclusion and the other to an incorrect one. This approach enables us to construct the MMC (MCTS-based Multimodal Critique) dataset, upon which we further develop a comprehensive training and inference pipeline. Extensive experiments conducted on several public benchmark datasets and mainstream VLMs demonstrate that our approach significantly improves the performance of VLM on complex multimodal reasoning tasks, underscoring its effectiveness and wide applicability.

Paper Structure

This paper contains 17 sections, 8 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: The construction pipeline of MMC dataset.
  • Figure 2: Prompt templates used for actor and critic during inference.
  • Figure 3: Case studies on evaluation samples from MathVista. Our critic model is capable of identifying both visual perception error (left) and reasoning error (right), and provides corrective feedback to guide the actor model in refining its reasoning and arriving at the correct answer.
  • Figure 4: An example of iterative refinement. The actor model produces an initially incorrect answer and iteratively corrects its reasoning through critic's feedback, ultimately converging to the correct solution after two iterations.
  • Figure 5: Prompt template used for annotator to generate critique.
  • ...and 1 more figures