Table of Contents
Fetching ...

Generative Frame Sampler for Long Video Understanding

Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, Junnan Li

TL;DR

The paper tackles the bottleneck of understanding hours-long videos with VideoLLMs by proposing GenS, a generative frame sampler that identifies instruction-relevant frames to feed into VideoQA backbones. It introduces GenS-Video-150K, a dense, multi-stage dataset for grounding frames with confidence scores, and a MOE-based GenS architecture built on Aria to enable efficient, flexible frame retrieval across varying input FPS. Through extensive experiments on LongVideoBench, MLVU, and HourVideo, GenS delivers consistent improvements across open-source and proprietary VideoLLMs, achieving state-of-the-art results on multiple benchmarks and demonstrating strong temporal grounding capabilities. The work also explores extensions such as coarse-to-fine hybrid sampling and dataset combinations, providing practical guidance for deploying GenS in real-world long-form video understanding tasks. Overall, GenS advances efficient long-form video perception by shifting frame selection into a learned, instruction-aware retrieval process that complements existing VideoLLMs.

Abstract

Despite recent advances in Video Large Language Models (VideoLLMs), effectively understanding long-form videos remains a significant challenge. Perceiving lengthy videos containing thousands of frames poses substantial computational burden. To mitigate this issue, this paper introduces Generative Frame Sampler (GenS), a plug-and-play module integrated with VideoLLMs to facilitate efficient lengthy video perception. Built upon a lightweight VideoLLM, GenS leverages its inherent vision-language capabilities to identify question-relevant frames. To facilitate effective retrieval, we construct GenS-Video-150K, a large-scale video instruction dataset with dense frame relevance annotations. Extensive experiments demonstrate that GenS consistently boosts the performance of various VideoLLMs, including open-source models (Qwen2-VL-7B, Aria-25B, VILA-40B, LLaVA-Video-7B/72B) and proprietary assistants (GPT-4o, Gemini). When equipped with GenS, open-source VideoLLMs achieve impressive state-of-the-art results on long-form video benchmarks: LLaVA-Video-72B reaches 66.8 (+4.3) on LongVideoBench and 77.0 (+2.7) on MLVU, while Aria obtains 39.2 on HourVideo surpassing the Gemini-1.5-pro by 1.9 points. We will release all datasets and models at https://generative-sampler.github.io.

Generative Frame Sampler for Long Video Understanding

TL;DR

The paper tackles the bottleneck of understanding hours-long videos with VideoLLMs by proposing GenS, a generative frame sampler that identifies instruction-relevant frames to feed into VideoQA backbones. It introduces GenS-Video-150K, a dense, multi-stage dataset for grounding frames with confidence scores, and a MOE-based GenS architecture built on Aria to enable efficient, flexible frame retrieval across varying input FPS. Through extensive experiments on LongVideoBench, MLVU, and HourVideo, GenS delivers consistent improvements across open-source and proprietary VideoLLMs, achieving state-of-the-art results on multiple benchmarks and demonstrating strong temporal grounding capabilities. The work also explores extensions such as coarse-to-fine hybrid sampling and dataset combinations, providing practical guidance for deploying GenS in real-world long-form video understanding tasks. Overall, GenS advances efficient long-form video perception by shifting frame selection into a learned, instruction-aware retrieval process that complements existing VideoLLMs.

Abstract

Despite recent advances in Video Large Language Models (VideoLLMs), effectively understanding long-form videos remains a significant challenge. Perceiving lengthy videos containing thousands of frames poses substantial computational burden. To mitigate this issue, this paper introduces Generative Frame Sampler (GenS), a plug-and-play module integrated with VideoLLMs to facilitate efficient lengthy video perception. Built upon a lightweight VideoLLM, GenS leverages its inherent vision-language capabilities to identify question-relevant frames. To facilitate effective retrieval, we construct GenS-Video-150K, a large-scale video instruction dataset with dense frame relevance annotations. Extensive experiments demonstrate that GenS consistently boosts the performance of various VideoLLMs, including open-source models (Qwen2-VL-7B, Aria-25B, VILA-40B, LLaVA-Video-7B/72B) and proprietary assistants (GPT-4o, Gemini). When equipped with GenS, open-source VideoLLMs achieve impressive state-of-the-art results on long-form video benchmarks: LLaVA-Video-72B reaches 66.8 (+4.3) on LongVideoBench and 77.0 (+2.7) on MLVU, while Aria obtains 39.2 on HourVideo surpassing the Gemini-1.5-pro by 1.9 points. We will release all datasets and models at https://generative-sampler.github.io.

Paper Structure

This paper contains 42 sections, 4 figures, 15 tables.

Figures (4)

  • Figure 1: (a) An example of long video question-answering (VideoQA) using different frame samplers. Our Generative Frame Sampler (GenS) accurately identifies relevant frame sequences based on the user question, further enhancing the performance of the downstream VideoQA assistant. (b) VideoQA accuracy results of state-of-the-art VideoQA assistants (Aria liAriaOpenMultimodal2025 and GPT-4o gpt4o) when equipped with different frame samplers on the Vision-Centric subset of LongVideoBench longvideobench .
  • Figure 2: Ablation study on different input and output frame indexing formats.
  • Figure 3: Visualization of GenS integrated with VILA-v1.5-40B (<=14frames) on MLVU dataset.
  • Figure 4: Visualization of annotated data sample from GenS-Video-150K.