Table of Contents
Fetching ...

Towards Sparse Video Understanding and Reasoning

Chenwei Xu, Zhen Ye, Shang Wu, Weijian Li, Zihan Wang, Zhuofan Xia, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu

TL;DR

Across multiple VQA benchmarks, revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.

Abstract

We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.

Towards Sparse Video Understanding and Reasoning

TL;DR

Across multiple VQA benchmarks, revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.

Abstract

We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.
Paper Structure (27 sections, 25 equations, 4 figures, 12 tables)

This paper contains 27 sections, 25 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Summary-as-State.ReViSe operates analogously to a recurrent neural network: it maintains a state that propagates information from previous turns to the VLM.
  • Figure 2: Overview of ReViSe: multi-round reasoning and adaptive frame selection. Given an initial set of frames and a question, the VLM agent infers the video context to update the summary and selects relevant frames based on its reasoning. In the next round, the agent reasons over the selected frames and the updated summary to generate the final answer.
  • Figure 3: ReViSe. ReViSe consists of three components: multi-round conversation, a structured output protocol, and a summary-as-state. Each round, the VLM agent receives (i) the entire conversation history, (ii) the current prompt, and (iii) the chosen video frames, annotated with their timestamps and the video's total frame count. In the first round, a formatting guideline is also provided. The VLM outputs <summary> plus either <frames> (request) or <answer> (final), with the summary carrying the persistent state across rounds. ReViSe then extracts the new frames and starts the next round with updated prompt and history. The conversation ends when the VLM produces a valid answer or the maximum number of rounds is reached.
  • Figure 4: Accuracy--frames Pareto frontier. Each point corresponds to a different frame budget $N$, and the frontier is monotone.