Table of Contents
Fetching ...

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan

TL;DR

The paper identifies the limits of text-based self-reflection for long-form video understanding and introduces REVISOR, a two-stage, tool-augmented multimodal reasoning framework that revisits visual segments during reflection. It couples this with the Dual Attribution Decoupled Reward (DADR) to align reasoning with causal visual evidence under reinforcement learning, enabling robust multisensory introspection without extra supervised fine-tuning. Across four benchmarks (VideoMME, LongVideoBench, MLVU, LVBench), REVISOR yields consistent accuracy gains, underscoring the importance of visual rethinking in long-form video tasks. The work advances multimodal introspection by formalizing visually-grounded reflection and segment-level evidence sufficiency, with practical implications for building more reliable long-form video understanding systems.

Abstract

Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

TL;DR

The paper identifies the limits of text-based self-reflection for long-form video understanding and introduces REVISOR, a two-stage, tool-augmented multimodal reasoning framework that revisits visual segments during reflection. It couples this with the Dual Attribution Decoupled Reward (DADR) to align reasoning with causal visual evidence under reinforcement learning, enabling robust multisensory introspection without extra supervised fine-tuning. Across four benchmarks (VideoMME, LongVideoBench, MLVU, LVBench), REVISOR yields consistent accuracy gains, underscoring the importance of visual rethinking in long-form video tasks. The work advances multimodal introspection by formalizing visually-grounded reflection and segment-level evidence sufficiency, with practical implications for building more reliable long-form video understanding systems.

Abstract

Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.

Paper Structure

This paper contains 23 sections, 8 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Operational workflow of the proposed REVISOR framework, contrasting it with traditional reflection mechanisms. The top panel illustrates a typical traditional approach, often employing a text-based re-evaluation mechanism. In contrast, the bottom panel details the REVISOR framework. This process involves two distinct stages: (1) Initial Inference, which generates a preliminary reasoning trace and identifies critical regions for detailed analysis; and (2) Reflective Reasoning, which integrates this initial trace with newly sampled, fine-grained visual evidence to yield a refined and robust final prediction.
  • Figure 2: Motivation for proposing a multimodal reflection mechanism. Left: Text-only reflection mechanisms, such as VL-Rethinker, achieve significant performance improvements in image understanding tasks. Middle: However, applying the same text-based reflection strategy to long-form video understanding leads to performance degradation. Right: Incorporating a revisit of key video segments during the reflection stage effectively improves performance on video understanding tasks.
  • Figure 3: Overview of the Dual-Attribution Decoupled Reward Mechanism (DADR). Final Answer Verification Reward (top) is derived from verifying the correctness of the model's synthesized final answer, directly targeting the accuracy objective of the reflective stage. Conversely, Causal Segment Sufficiency Reward (bottom) is granted upon verifying an attribution answer derived exclusively from reviewed video segments, thereby guiding the model to identify and utilize segments highly pertinent to the user query.
  • Figure 4: The superior efficacy of visual reflection over textual reflection in long-form video understanding. The left panel demonstrates that the length of the generated textual reflection consistently decreases throughout training. The right panel further indicates that forcing the model to perform longer textual reflection actually leads to a degradation in model performance.
  • Figure 5: Comparative accuracy of key moment review across different methods. Methods based on the REVISOR framework and its variants are highlighted in blue, while different Temporal Grounding baselines are represented in green.
  • ...and 8 more figures