Table of Contents
Fetching ...

Reinforcing Consistency in Video MLLMs with Structured Rewards

Yihao Quan, Zeru Shi, Jinman Zhao, Ruixiang Tang

Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.

Reinforcing Consistency in Video MLLMs with Structured Rewards

Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.

Paper Structure

This paper contains 34 sections, 8 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Motivating observation. We evaluate a caption with a top-down compositional consistency audit. Starting from a root claim, we ask whether the supporting evidence beneath it is also correct. The audit set contains 200 human-created audit samples derived from AGQA-Decomp, with the root, attribute, and existence labels manually refined and verified.
  • Figure 2: Overview of our method. We compare the sampled caption and the reference caption in a shared structured space rather than at the whole-sentence level. After lightweight preprocessing, both captions are parsed into scene-graph elements, scored by revision-based factual, temporal, and VQA verification rewards, and then used to update the policy with REINFORCE plus KL regularization.
  • Figure 3: Reward-component and training-stage ablations. Results are averaged within temporal understanding, conventional video understanding, and hallucination-oriented benchmark groups.
  • Figure 4: Scaling behavior across model size and training data size. We study two axes of scaling for our post-training recipe: backbone size and training data size. Results are averaged within temporal understanding, conventional video understanding, and hallucination-oriented benchmark groups.
  • Figure 5: Compact qualitative QA and verification comparison. The cases cover temporal QA, procedural MCQ, fine-grained MCQ, and open-ended scene-change reasoning. Across all four examples, the base model makes a locally plausible but grounded error, whereas our method recovers the correct answer with less ambiguity.
  • ...and 1 more figures