Table of Contents
Fetching ...

VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding

Junbo Zou, Ziheng Huang, Shengjie Zhang, Liwen Zhang, Weining Shen

TL;DR

VideoBrain tackles the challenge of understanding long-form videos under compute constraints by introducing an end-to-end vision-language framework that adaptively samples frames through dual agents: a CLIP-based semantic retrieval agent and a Uniform temporal sampling agent. The model learns when to invoke these agents via a two-stage training pipeline (supervised fine-tuning followed by reinforcement learning with a behavior-aware reward) and a data-categorization scheme (Direct, Adaptive, Active) to prevent reward hacking. Empirical results across four long-video benchmarks show consistent improvements (+3.5% to +9.0%) with 30-40% fewer frames and strong generalization to short-video tasks. The approach advances end-to-end adaptive perception for long-duration video understanding and offers a scalable, efficient solution for real-world VLM workloads.

Abstract

Long-form video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices). We propose VideoBrain, an end-to-end framework that enables VLMs to adaptively acquire visual information through learned sampling policies. Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals. Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency. To prevent models from invoking agents indiscriminately to maximize rewards, we introduce a behavior-aware reward function coupled with a data classification pipeline that teaches the model when agent invocation is genuinely beneficial. Experiments on four long video benchmarks demonstrate that VideoBrain achieves +3.5% to +9.0% improvement over the baseline while using 30-40% fewer frames, with strong cross-dataset generalization to short video benchmarks.

VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding

TL;DR

VideoBrain tackles the challenge of understanding long-form videos under compute constraints by introducing an end-to-end vision-language framework that adaptively samples frames through dual agents: a CLIP-based semantic retrieval agent and a Uniform temporal sampling agent. The model learns when to invoke these agents via a two-stage training pipeline (supervised fine-tuning followed by reinforcement learning with a behavior-aware reward) and a data-categorization scheme (Direct, Adaptive, Active) to prevent reward hacking. Empirical results across four long-video benchmarks show consistent improvements (+3.5% to +9.0%) with 30-40% fewer frames and strong generalization to short-video tasks. The approach advances end-to-end adaptive perception for long-duration video understanding and offers a scalable, efficient solution for real-world VLM workloads.

Abstract

Long-form video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices). We propose VideoBrain, an end-to-end framework that enables VLMs to adaptively acquire visual information through learned sampling policies. Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals. Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency. To prevent models from invoking agents indiscriminately to maximize rewards, we introduce a behavior-aware reward function coupled with a data classification pipeline that teaches the model when agent invocation is genuinely beneficial. Experiments on four long video benchmarks demonstrate that VideoBrain achieves +3.5% to +9.0% improvement over the baseline while using 30-40% fewer frames, with strong cross-dataset generalization to short video benchmarks.
Paper Structure (65 sections, 3 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 65 sections, 3 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: VideoBrain framework and two complementary sampling agents. Top: The model takes video frames and a question as input, reasons about information sufficiency, and selectively invokes agents to gather additional frames before producing a final answer. Bottom left: The CLIP Sample Agent searches semantically across 256 candidate frames to retrieve visually relevant content (e.g., finding a specific scene described in the question). Bottom right: The Uniform Sample Agent densely samples within a temporal interval to capture fine-grained sequential information (e.g., observing what happens between two events).
  • Figure 2: Overview of the VideoBrain training framework. Top: Dual-model evaluation classifies video QA samples into Direct, Adaptive, and Active categories based on whether agent invocation improves performance. Bottom: SFT teaches thinking and agent invocation. During RL, the model iteratively reasons and calls sampling agents to gather frames, with behavior-aware rewards encouraging efficiency on Direct questions and exploration on Active ones.
  • Figure 3: System prompt for VideoBrain inference. The agent is instructed to use uniform_sample for temporal understanding and clip_sample for semantic retrieval.
  • Figure 4: User prompt template for the initial turn.
  • Figure 5: Turn prompt template for subsequent turns after agent invocation. The sampled frames are appended with their frame indices.
  • ...and 6 more figures