Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?

Tianyuan Qu; Longxiang Tang; Bohao Peng; Senqiao Yang; Bei Yu; Jiaya Jia

Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?

Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, Jiaya Jia

TL;DR

This work addresses the sampling dilemma in long-video vision-language understanding by introducing the concept of Necessary Sampling Density (NSD) and presenting LSDBench to stress high-NSD scenarios. It proposes a two-stage, reasoning-driven hierarchical sampling (RHS) framework that first reasons over sparse keyframes to locate informative segments and then densely samples those segments for inference. A lightweight Semantic-Guided Frame Selector (SGFS) further improves efficiency by selecting diverse, informative frames without relying on the question prompt. Through an automated QA generation pipeline and comprehensive experiments, the approach demonstrates improved sampling efficiency and competitive accuracy with substantially fewer frames, establishing a new benchmark and methodology for long-video understanding in LVLMs.

Abstract

The rise of Large Vision-Language Models (LVLMs) has significantly advanced video understanding. However, efficiently processing long videos remains a challenge due to the ``Sampling Dilemma'': low-density sampling risks missing critical information, while high-density sampling introduces redundancy. To address this issue, we introduce LSDBench, the first benchmark designed to evaluate LVLMs on long-video tasks by constructing high Necessary Sampling Density (NSD) questions, where NSD represents the minimum sampling density required to accurately answer a given question. LSDBench focuses on dense, short-duration actions to rigorously assess the sampling strategies employed by LVLMs. To tackle the challenges posed by high-NSD questions, we propose a novel Reasoning-Driven Hierarchical Sampling (RHS) framework, which combines global localization of question-relevant cues with local dense sampling for precise inference. Additionally, we develop a lightweight Semantic-Guided Frame Selector to prioritize informative frames, enabling RHS to achieve comparable or superior performance with significantly fewer sampled frames. Together, our LSDBench and RHS framework address the unique challenges of high-NSD long-video tasks, setting a new standard for evaluating and improving LVLMs in this domain. Our benchmark and evaluation codes has been released at: https://github.com/dvlab-research/LSDBench

Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?

TL;DR

Abstract

Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)