Table of Contents
Fetching ...

MJ1: Multimodal Judgment via Grounded Verification

Bhavesh Kumar, Dylan Feng, Leonard Tang

Abstract

Multimodal judges struggle to ground decisions in visual evidence. We present MJ1, a multimodal judge trained with reinforcement learning that enforces visual grounding through a structured grounded verification chain (observations $\rightarrow$ claims $\rightarrow$ verification $\rightarrow$ evaluation $\rightarrow$ scoring) and a counterfactual consistency reward that penalizes position bias. Even without training, our mechanism improves base-model accuracy on MMRB2 by +3.8 points on Image Editing and +1.7 on Multimodal Reasoning. After training, MJ1, with only 3B active parameters, achieves 77.0% accuracy on MMRB2 and surpasses orders-of-magnitude larger models like Gemini-3-Pro. These results show that grounded verification and consistency-based training substantially improve multimodal judgment without increasing model scale.

MJ1: Multimodal Judgment via Grounded Verification

Abstract

Multimodal judges struggle to ground decisions in visual evidence. We present MJ1, a multimodal judge trained with reinforcement learning that enforces visual grounding through a structured grounded verification chain (observations claims verification evaluation scoring) and a counterfactual consistency reward that penalizes position bias. Even without training, our mechanism improves base-model accuracy on MMRB2 by +3.8 points on Image Editing and +1.7 on Multimodal Reasoning. After training, MJ1, with only 3B active parameters, achieves 77.0% accuracy on MMRB2 and surpasses orders-of-magnitude larger models like Gemini-3-Pro. These results show that grounded verification and consistency-based training substantially improve multimodal judgment without increasing model scale.
Paper Structure (13 sections, 5 equations, 11 figures, 3 tables)

This paper contains 13 sections, 5 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: MJ$_1$ grounded verification chain. Judgement scores are generated based on verifying response claims against visual observations. Explicit visual grounding of the reasoning chain mitigates visual attention degradation.
  • Figure 2: Two-phase training pipeline. Cold-start SFT on distilled reasoning traces establishes format and basic judgment capability. GRPO then optimizes a composite reward that incentivizes both correctness and position invariance.
  • Figure 3: Computational structure comparison. (a) Standard judgment permits a shortcut path (dashed red) where scores depend minimally on images. (b) MJ$_1$ forces computation through observations $O$, claim extraction $C$, and verification $V$. The dashed arrow indicates the forced back-reference from verification to observations. The consistency reward $R_{\text{cons}}$ couples verification to scores, requiring coherent image-grounded reasoning.
  • Figure 4: Consistency as a grounding signal. (a) Mean $R_{\text{cons}}$ under three image conditions on an untrained base model. Shuffled images yield the lowest consistency, below even the no-image baseline. (b) Mean $R_{\text{correct}}$ shows degraded performance when visual grounding is disrupted, with both shuffled and blank conditions approaching random chance.
  • Figure 5: Cold-start SFT training loss.
  • ...and 6 more figures