Generative Frame Sampler for Long Video Understanding
Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, Junnan Li
TL;DR
The paper tackles the bottleneck of understanding hours-long videos with VideoLLMs by proposing GenS, a generative frame sampler that identifies instruction-relevant frames to feed into VideoQA backbones. It introduces GenS-Video-150K, a dense, multi-stage dataset for grounding frames with confidence scores, and a MOE-based GenS architecture built on Aria to enable efficient, flexible frame retrieval across varying input FPS. Through extensive experiments on LongVideoBench, MLVU, and HourVideo, GenS delivers consistent improvements across open-source and proprietary VideoLLMs, achieving state-of-the-art results on multiple benchmarks and demonstrating strong temporal grounding capabilities. The work also explores extensions such as coarse-to-fine hybrid sampling and dataset combinations, providing practical guidance for deploying GenS in real-world long-form video understanding tasks. Overall, GenS advances efficient long-form video perception by shifting frame selection into a learned, instruction-aware retrieval process that complements existing VideoLLMs.
Abstract
Despite recent advances in Video Large Language Models (VideoLLMs), effectively understanding long-form videos remains a significant challenge. Perceiving lengthy videos containing thousands of frames poses substantial computational burden. To mitigate this issue, this paper introduces Generative Frame Sampler (GenS), a plug-and-play module integrated with VideoLLMs to facilitate efficient lengthy video perception. Built upon a lightweight VideoLLM, GenS leverages its inherent vision-language capabilities to identify question-relevant frames. To facilitate effective retrieval, we construct GenS-Video-150K, a large-scale video instruction dataset with dense frame relevance annotations. Extensive experiments demonstrate that GenS consistently boosts the performance of various VideoLLMs, including open-source models (Qwen2-VL-7B, Aria-25B, VILA-40B, LLaVA-Video-7B/72B) and proprietary assistants (GPT-4o, Gemini). When equipped with GenS, open-source VideoLLMs achieve impressive state-of-the-art results on long-form video benchmarks: LLaVA-Video-72B reaches 66.8 (+4.3) on LongVideoBench and 77.0 (+2.7) on MLVU, while Aria obtains 39.2 on HourVideo surpassing the Gemini-1.5-pro by 1.9 points. We will release all datasets and models at https://generative-sampler.github.io.
