Table of Contents
Fetching ...

Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

Junxin Wang, Dai Guan, Weijie Qiu, Zhihang Li, Yongbo Gai, Zhengyi Yang, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

Abstract

Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier's misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step-wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input image. EVPV matches checklist claims against these constraints to compute a scalar visual reliability signal, and calibrates PRM step rewards via reliability gating: rewards for visually dependent steps are attenuated when reliability is low and preserved when reliability is high. This decouples perceptual uncertainty from logical evaluation without per-step tool calls. Experiments on VisualProcessBench and six multimodal reasoning benchmarks show that EVPV improves step-level verification and consistently boosts Best-of-N reranking accuracy over strong baselines. Furthermore, injecting controlled corruption into the extracted constraints produces monotonic performance degradation, providing causal evidence that the gains arise from constraint fidelity and explicit premise verification rather than incidental prompt effects. Code is available at: https://github.com/Qwen-Applications/EVPV-PRM

Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

Abstract

Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier's misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step-wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input image. EVPV matches checklist claims against these constraints to compute a scalar visual reliability signal, and calibrates PRM step rewards via reliability gating: rewards for visually dependent steps are attenuated when reliability is low and preserved when reliability is high. This decouples perceptual uncertainty from logical evaluation without per-step tool calls. Experiments on VisualProcessBench and six multimodal reasoning benchmarks show that EVPV improves step-level verification and consistently boosts Best-of-N reranking accuracy over strong baselines. Furthermore, injecting controlled corruption into the extracted constraints produces monotonic performance degradation, providing causal evidence that the gains arise from constraint fidelity and explicit premise verification rather than incidental prompt effects. Code is available at: https://github.com/Qwen-Applications/EVPV-PRM
Paper Structure (46 sections, 22 equations, 4 figures, 10 tables)

This paper contains 46 sections, 22 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: EVPV: premise-aware process reward modeling for reliable multimodal reasoning.(A)Motivating failure case. A standard VL-PRM (VisualPRM) can reward a locally fluent step that relies on a hallucinated visual premise (e.g., "a cylindrical hole"). EVPV prompts the policy to state an explicit visual checklist, verifies it against independently extracted structured visual constraints, and gates step rewards when the premise is unreliable. (B)Where step errors come from. On VisualProcessBench, most step errors stem from visual misinterpretation (left); these errors are dominated by structural misunderstandings and value misreadings (right), motivating explicit premise verification. (C)Step-level verification. EVPV-PRM achieves higher overall Macro-F1 on VisualProcessBench than prior multimodal PRMs. (D)Deployable test-time gains. Under Best-of-$8$ reranking for InternVL2.5 policies, EVPV-PRM yields consistent BoN@8 improvements $\Delta_8=\mathrm{BoN@8}-\mathrm{Pass@1}$ across model scales, indicating more reliable selection of grounded solutions under test-time scaling.
  • Figure 2: Overview of EVPV-PRM. Given an image $I$ and question $q$, the policy model generates a step-by-step solution and, for each step, declares whether it depends on visual evidence, forming a visual checklist of explicit claims. In parallel, a constraint extractor predicts a structured set of visual facts $C$ (numeric readings, geometric relations, and compositional structure). We compute a visual reliability score $r$ by matching checklist claims against $C$ to obtain support scores and aggregating them into a single confidence signal. A step verifier then produces base step rewards, which are calibrated by reliability gating: rewards for non-visual steps are kept unchanged, while rewards for visually dependent steps are down-weighted when $r$ is low and preserved when $r$ is high. The resulting reliability-gated step rewards are aggregated for Best-of-$N$ reranking and process diagnosis.
  • Figure 3: Training pipeline for the constraint extractor and step verifier. We train the constraint extractor $E_\phi$ by distilling gold structured constraints $C^\star$ from a strong teacher on image--question inputs (here, 20K samples from VisualPRM400K with qwen3-vl-235b-a22b-instruct), using supervised fine-tuning with $\mathcal{L}_{\mathrm{con}}=-\log P_\phi(C^\star\mid I,q)$. After SFT initialization, we construct preference pairs by letting $E_\phi$ generate candidate constraints and selecting hard cases where the teacher identifies large deviations from $C^\star$; we then apply DPO to improve constraint fidelity. In parallel, we train the step verifier $V_\theta$ with step-level correctness labels via binary cross-entropy. Gold constraints are used only during training; inference relies solely on predicted constraints and checklist consistency.
  • Figure 4: Constraint quality--performance causal curves under controlled noise.