Table of Contents
Fetching ...

When Thinking Drifts: Evidential Grounding for Robust Video Reasoning

Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman

TL;DR

This work reveals a critical shortcoming of applying text-based chain-of-thought reasoning to video tasks: reasoning chains can drift away from visual evidence, producing plausible but incorrect inferences. It introduces Visual Evidence Reward (VER), an RL-based framework that anchors intermediate reasoning in verifiable visual facts via a judge and question-specific evidence prompts, culminating in the Video-VER model. Across 10 video benchmarks, Video-VER achieves top performance by grounding reasoning in visual input, significantly reducing hallucinations and improving temporal understanding. The study highlights the importance of grounding over verbosity for multimodal reasoning and provides a practical pathway to safer, more reliable video intelligence in large multimodal models.

Abstract

Video reasoning, the task of enabling machines to infer from dynamic visual content through multi-step logic, is crucial for advanced AI. While the Chain-of-Thought (CoT) mechanism has enhanced reasoning in text-based tasks, its application to video understanding remains underexplored. This paper presents a systematic analysis revealing that CoT often degrades performance in video reasoning, generating verbose but misleading internal monologues, and leading to hallucinated visual details and overridden correct intuitions - a phenomenon we term "visual thinking drift". We explain this drift through a Bayesian lens, positing that CoT traces often diverge from actual visual evidence, instead amplifying internal biases or language priors, causing models to storytell rather than engage in grounded reasoning. To counteract this, we introduce Visual Evidence Reward (VER), a novel reinforcement learning framework that explicitly rewards the generation of reasoning traces that are verifiably grounded in visual evidence. Comprehensive evaluation across 10 diverse video understanding benchmarks demonstrates that our Video-VER consistently achieves top performance. Our work sheds light on the distinct challenges of video-centric reasoning and encourages the development of AI that robustly grounds its inferences in visual evidence - for large multimodal models that not only "think before answering", but also "see while thinking".

When Thinking Drifts: Evidential Grounding for Robust Video Reasoning

TL;DR

This work reveals a critical shortcoming of applying text-based chain-of-thought reasoning to video tasks: reasoning chains can drift away from visual evidence, producing plausible but incorrect inferences. It introduces Visual Evidence Reward (VER), an RL-based framework that anchors intermediate reasoning in verifiable visual facts via a judge and question-specific evidence prompts, culminating in the Video-VER model. Across 10 video benchmarks, Video-VER achieves top performance by grounding reasoning in visual input, significantly reducing hallucinations and improving temporal understanding. The study highlights the importance of grounding over verbosity for multimodal reasoning and provides a practical pathway to safer, more reliable video intelligence in large multimodal models.

Abstract

Video reasoning, the task of enabling machines to infer from dynamic visual content through multi-step logic, is crucial for advanced AI. While the Chain-of-Thought (CoT) mechanism has enhanced reasoning in text-based tasks, its application to video understanding remains underexplored. This paper presents a systematic analysis revealing that CoT often degrades performance in video reasoning, generating verbose but misleading internal monologues, and leading to hallucinated visual details and overridden correct intuitions - a phenomenon we term "visual thinking drift". We explain this drift through a Bayesian lens, positing that CoT traces often diverge from actual visual evidence, instead amplifying internal biases or language priors, causing models to storytell rather than engage in grounded reasoning. To counteract this, we introduce Visual Evidence Reward (VER), a novel reinforcement learning framework that explicitly rewards the generation of reasoning traces that are verifiably grounded in visual evidence. Comprehensive evaluation across 10 diverse video understanding benchmarks demonstrates that our Video-VER consistently achieves top performance. Our work sheds light on the distinct challenges of video-centric reasoning and encourages the development of AI that robustly grounds its inferences in visual evidence - for large multimodal models that not only "think before answering", but also "see while thinking".

Paper Structure

This paper contains 18 sections, 5 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Two examples of Visual Thinking Drift phenomenon, where the reasoning chain, as it grows longer, increasingly relies on hallucinated facts or incomplete temporal context—drawing conclusions from language patterns rather than grounding in the actual video content.
  • Figure 2: Compared to directly prompting the model for an answer, instructing the model to "think before answering" leads to a noticeable performance drop in open-source MLLMs such as Qwen2.5-VL Qwen2.5-VL and Video-R1 feng2025videor1 across multiple benchmarks (10 are shown here).
  • Figure 3: Gains (green) and losses (pink) with CoT prompt, showing that reasoning-driven generation is valuable for multi-hop, causal, or interpretability-driven tasks like object counting, but weakens both large and small models on lightweight perceptual questions such as scene transition detection.
  • Figure 4: Even with GPT-4o (a strong reasoning model), a considerable portion of questions (light blue areas) are better answered directly than with CoT reasoning, implying significant room for improvement in CoT reasoning. For VSI-Bench and MMVU, results are based on MCQ subset.
  • Figure 5: Visualization of visual facts generated from the training data. Chain-of-thought responses that actively reference visual evidence are rewarded, while those that do not receive zero reward.
  • ...and 9 more figures