Table of Contents
Fetching ...

Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?

Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, Jiaya Jia

TL;DR

This work addresses the sampling dilemma in long-video vision-language understanding by introducing the concept of Necessary Sampling Density (NSD) and presenting LSDBench to stress high-NSD scenarios. It proposes a two-stage, reasoning-driven hierarchical sampling (RHS) framework that first reasons over sparse keyframes to locate informative segments and then densely samples those segments for inference. A lightweight Semantic-Guided Frame Selector (SGFS) further improves efficiency by selecting diverse, informative frames without relying on the question prompt. Through an automated QA generation pipeline and comprehensive experiments, the approach demonstrates improved sampling efficiency and competitive accuracy with substantially fewer frames, establishing a new benchmark and methodology for long-video understanding in LVLMs.

Abstract

The rise of Large Vision-Language Models (LVLMs) has significantly advanced video understanding. However, efficiently processing long videos remains a challenge due to the ``Sampling Dilemma'': low-density sampling risks missing critical information, while high-density sampling introduces redundancy. To address this issue, we introduce LSDBench, the first benchmark designed to evaluate LVLMs on long-video tasks by constructing high Necessary Sampling Density (NSD) questions, where NSD represents the minimum sampling density required to accurately answer a given question. LSDBench focuses on dense, short-duration actions to rigorously assess the sampling strategies employed by LVLMs. To tackle the challenges posed by high-NSD questions, we propose a novel Reasoning-Driven Hierarchical Sampling (RHS) framework, which combines global localization of question-relevant cues with local dense sampling for precise inference. Additionally, we develop a lightweight Semantic-Guided Frame Selector to prioritize informative frames, enabling RHS to achieve comparable or superior performance with significantly fewer sampled frames. Together, our LSDBench and RHS framework address the unique challenges of high-NSD long-video tasks, setting a new standard for evaluating and improving LVLMs in this domain. Our benchmark and evaluation codes has been released at: https://github.com/dvlab-research/LSDBench

Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?

TL;DR

This work addresses the sampling dilemma in long-video vision-language understanding by introducing the concept of Necessary Sampling Density (NSD) and presenting LSDBench to stress high-NSD scenarios. It proposes a two-stage, reasoning-driven hierarchical sampling (RHS) framework that first reasons over sparse keyframes to locate informative segments and then densely samples those segments for inference. A lightweight Semantic-Guided Frame Selector (SGFS) further improves efficiency by selecting diverse, informative frames without relying on the question prompt. Through an automated QA generation pipeline and comprehensive experiments, the approach demonstrates improved sampling efficiency and competitive accuracy with substantially fewer frames, establishing a new benchmark and methodology for long-video understanding in LVLMs.

Abstract

The rise of Large Vision-Language Models (LVLMs) has significantly advanced video understanding. However, efficiently processing long videos remains a challenge due to the ``Sampling Dilemma'': low-density sampling risks missing critical information, while high-density sampling introduces redundancy. To address this issue, we introduce LSDBench, the first benchmark designed to evaluate LVLMs on long-video tasks by constructing high Necessary Sampling Density (NSD) questions, where NSD represents the minimum sampling density required to accurately answer a given question. LSDBench focuses on dense, short-duration actions to rigorously assess the sampling strategies employed by LVLMs. To tackle the challenges posed by high-NSD questions, we propose a novel Reasoning-Driven Hierarchical Sampling (RHS) framework, which combines global localization of question-relevant cues with local dense sampling for precise inference. Additionally, we develop a lightweight Semantic-Guided Frame Selector to prioritize informative frames, enabling RHS to achieve comparable or superior performance with significantly fewer sampled frames. Together, our LSDBench and RHS framework address the unique challenges of high-NSD long-video tasks, setting a new standard for evaluating and improving LVLMs in this domain. Our benchmark and evaluation codes has been released at: https://github.com/dvlab-research/LSDBench

Paper Structure

This paper contains 17 sections, 4 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Pipeline of our four-stage data annotation pipeline. We first segment the video into one-minute intervals and generate captions for each segment. These captions are then hierarchically clustered: the first layer clusters by scenes, while the second layer clusters by actions or events. Once the hierarchical tree structure is established, we utilize GPT-4o to generate questions, which are subsequently filtered and optimized through a two-round refinement process. For generating free-form answers to the questions, we leverage Gemini 2.0 Flash combined with narration-based annotations as auxiliary input. Finally, we construct multiple-choice options iteratively in an adversarial manner using GPT-4o.
  • Figure 2: Dataset statistics visualization.
  • Figure 3: Overview of our Reasoning-Driven Hierarchical Sampling (RHS) framework.
  • Figure 4: The line graph illustrates the relationship between the number of sampled frames (x-axis) and accuracy on LSDBench (y-axis). Solid lines represent results under the Full Video setting, while dashed lines with inverted triangles correspond to the Oracle setting. The gap between the Oracle and global uniform sampling highlights the potential for improved sampling strategies in long-video VLMs.
  • Figure 5: Correlation between overall accuracy (y-axis) and the coverage rate of the first-stage predictions on the ground truth target segments (x-axis).
  • ...and 3 more figures