Table of Contents
Fetching ...

ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding

Yuan Zhou, Litao Hua, Shilong Jin, Wentao Huang, Haoran Duan

TL;DR

ReaSon addresses the challenge of efficient video understanding under token constraints by introducing a Causal Information Bottleneck (CIB) for keyframe selection. It jointly optimizes predictive sufficiency and causal necessity via a reinforced causal search framework, featuring two modules that identify frames informative for answering and causally decisive for reasoning, with a composite reward combining answer accuracy, cycle consistency, and counterfactual signals. The method demonstrates state-of-the-art results across NExT-QA, EgoSchema, and Video-MME under limited-frame budgets and generalizes across multiple vision-language models, validating both efficacy and robustness. This approach provides a principled framework for causal, data-efficient video understanding with practical impact for real-time or resource-constrained scenarios.

Abstract

Keyframe selection has become essential for video understanding with vision-language models (VLMs) due to limited input tokens and the temporal sparsity of relevant information across video frames. Video understanding often relies on effective keyframes that are not only informative but also causally decisive. To this end, we propose Reinforced Causal Search with Information Bottleneck (ReaSon), a framework that formulates keyframe selection as an optimization problem with the help of a novel Causal Information Bottleneck (CIB), which explicitly defines keyframes as those satisfying both predictive sufficiency and causal necessity. Specifically, ReaSon employs a learnable policy network to select keyframes from a visually relevant pool of candidate frames to capture predictive sufficiency, and then assesses causal necessity via counterfactual interventions. Finally, a composite reward aligned with the CIB principle is designed to guide the selection policy through reinforcement learning. Extensive experiments on NExT-QA, EgoSchema, and Video-MME demonstrate that ReaSon consistently outperforms existing state-of-the-art methods under limited-frame settings, validating its effectiveness and generalization ability.

ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding

TL;DR

ReaSon addresses the challenge of efficient video understanding under token constraints by introducing a Causal Information Bottleneck (CIB) for keyframe selection. It jointly optimizes predictive sufficiency and causal necessity via a reinforced causal search framework, featuring two modules that identify frames informative for answering and causally decisive for reasoning, with a composite reward combining answer accuracy, cycle consistency, and counterfactual signals. The method demonstrates state-of-the-art results across NExT-QA, EgoSchema, and Video-MME under limited-frame budgets and generalizes across multiple vision-language models, validating both efficacy and robustness. This approach provides a principled framework for causal, data-efficient video understanding with practical impact for real-time or resource-constrained scenarios.

Abstract

Keyframe selection has become essential for video understanding with vision-language models (VLMs) due to limited input tokens and the temporal sparsity of relevant information across video frames. Video understanding often relies on effective keyframes that are not only informative but also causally decisive. To this end, we propose Reinforced Causal Search with Information Bottleneck (ReaSon), a framework that formulates keyframe selection as an optimization problem with the help of a novel Causal Information Bottleneck (CIB), which explicitly defines keyframes as those satisfying both predictive sufficiency and causal necessity. Specifically, ReaSon employs a learnable policy network to select keyframes from a visually relevant pool of candidate frames to capture predictive sufficiency, and then assesses causal necessity via counterfactual interventions. Finally, a composite reward aligned with the CIB principle is designed to guide the selection policy through reinforcement learning. Extensive experiments on NExT-QA, EgoSchema, and Video-MME demonstrate that ReaSon consistently outperforms existing state-of-the-art methods under limited-frame settings, validating its effectiveness and generalization ability.

Paper Structure

This paper contains 39 sections, 18 equations, 8 figures, 5 tables, 2 algorithms.

Figures (8)

  • Figure 1: Illustration of limitations of visual relevance and the importance of causal necessity in keyframe selection. The visual search method selects visually relevant frames (blue stars) but misses causally decisive frames (red stars). In contrast, reinforced causal search captures causally necessary frames, leading to more accurate answers.
  • Figure 2: Framework of proposed ReaSon. (a) and (b) illustrate the predictive sufficiency and causal necessity modules, respectively, where a policy network learns to select keyframes based on CIB-aligned rewards. (c) shows the structural causal models. $Q$ and $V$ denote the question and the video, respectively. $F$ and $S$ represent selected frame subsets. $S'$ is a counterfactual selection to assess causal necessity. $O$ means the output. Bottleneck variables are highlighted with orange circles.
  • Figure 3: The visualization of frame selection results demonstrates the effectiveness of our approach compared to the previous state-of-the-art method T*. Our approach pays less attention to irrelevant regions (in gray) and identifies more causal decisive keyframes.
  • Figure 4: The prompt template for visual elements, where < image> represents a PIL.Image object for each frame, and other angle-bracketed tokens are strings.
  • Figure 5: The prompt template for question answering.
  • ...and 3 more figures