Table of Contents
Fetching ...

Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision

Lingzhuang Sun, Ruitong Liu, Yuxia Zhu, Xiaohan Xu, Jingxuan Wei, Xiangxiang Zhang, Bihui Yu, Wentao Zhang

TL;DR

Guided Verifier reframes multimodal reasoning as a collaborative, closed-loop process by pairing a policy with a dynamic verifier that guides rollout in real time. A dedicated CoRe data synthesis pipeline provides process-level negatives and Correct-guide Reasoning trajectories to train the verifier, which is then used in Guided-GRPO RL with a composite reward to train the policy. Across MathVista, MathVerse, and MMMU, an 8B parameter model with guided verification achieves strong results, rivaling larger open-source models and proprietary systems while improving training stability and inference efficiency. The work demonstrates substantial mitigation of error propagation in multimodal RL and highlights data-centric supervision as a critical driver of performance gains.

Abstract

Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To facilitate this, we develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing \textbf{CoRe} dataset of process-level negatives and \textbf{Co}rrect-guide \textbf{Re}asoning trajectories to train the guided verifier. Extensive experiments on MathVista, MathVerse and MMMU indicate that by allocating compute to collaborative inference and dynamic verification, an 8B-parameter model can achieve strong performance.

Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision

TL;DR

Guided Verifier reframes multimodal reasoning as a collaborative, closed-loop process by pairing a policy with a dynamic verifier that guides rollout in real time. A dedicated CoRe data synthesis pipeline provides process-level negatives and Correct-guide Reasoning trajectories to train the verifier, which is then used in Guided-GRPO RL with a composite reward to train the policy. Across MathVista, MathVerse, and MMMU, an 8B parameter model with guided verification achieves strong results, rivaling larger open-source models and proprietary systems while improving training stability and inference efficiency. The work demonstrates substantial mitigation of error propagation in multimodal RL and highlights data-centric supervision as a critical driver of performance gains.

Abstract

Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To facilitate this, we develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing \textbf{CoRe} dataset of process-level negatives and \textbf{Co}rrect-guide \textbf{Re}asoning trajectories to train the guided verifier. Extensive experiments on MathVista, MathVerse and MMMU indicate that by allocating compute to collaborative inference and dynamic verification, an 8B-parameter model can achieve strong performance.
Paper Structure (45 sections, 1 theorem, 21 equations, 19 figures, 6 tables)

This paper contains 45 sections, 1 theorem, 21 equations, 19 figures, 6 tables.

Key Result

Theorem 3.1

Consider a policy rollout over $T$ steps with an average intrinsic error probability $\epsilon \in (0,1)$ per step. Let $\delta \in (0,1)$ denote the conditional failure probability of the verifier in detecting an error given that one has occurred. Under the assumption of negligible false rejections

Figures (19)

  • Figure 1: Conceptual Comparison of Reasoning Paradigms.
  • Figure 2: Overview of the Guided Verifier Framework. The proposed pipeline consists of three stages: (1) CoRe Dataset Synthesis. (2) Verifier SFT. (3) Guided-GRPO Algorithm.
  • Figure 3: Inference Efficiency Analysis: Interaction Turns.
  • Figure 4: Inference Efficiency Analysis: Token Consumption.
  • Figure 5: Training Dynamics and Stability Analysis.
  • ...and 14 more figures

Theorems & Definitions (1)

  • Theorem 3.1: Exponential Suppression of Error Propagation