Table of Contents
Fetching ...

APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

Hong Gao, Yiming Bao, Xuezhen Tu, Bin Zhong, Linan Yue, Minling Zhang

TL;DR

APVR addresses hour-level video understanding by introducing a training-free, dual-granularity retrieval framework that hierarchically preserves salient visual information. It combines Pivot Frame Retrieval, which expands queries into objects, descriptions, relations, and semantics and scores frames via CLIP and Grounding-DINO, with Pivot Token Retrieval, which applies query-aware cross-layer attention and adaptive token selection to keep a compact, highly relevant token set. The framework uses iterative adaptive resampling and temporal diffusion to maintain temporal coherence while staying within memory constraints, enabling processing of up to $K=1024$ frames for hour-long videos. Empirical results on LongVideoBench, VideoMME, and MLVU show state-of-the-art performance among both training-free and training-based methods, with notable improvements (e.g., up to $9.7\%$) over baselines across multiple MLLMs. APVR’s plug-and-play design and training-free nature offer a scalable, model-agnostic path to robust long-form video understanding.

Abstract

Current multimodal large language models (MLLMs) struggle with hour-level video understanding, facing significant challenges not only in modeling the substantial information volume of long videos but also in overcoming the memory wall and resource constraints during both training and inference. Although recent training-free approaches have alleviated resource demands by compressing visual features, their reliance on incomplete visual information limits the performance potential. To address these limitations, we propose Adaptive Pivot Visual information Retrieval (APVR), a training-free framework that hierarchically retrieves and retains sufficient and important visual information. It breakthroughs the memory wall limitation via two complementary components: Pivot Frame Retrieval employs query expansion and iterative spatio-semantic confidence scoring to identify relevant video frames, and Pivot Token Retrieval performs query-aware attention-driven token selection within up to 1024 pivot frames. This dual granularity approach enables the processing of hour-long videos while maintaining semantic fidelity. Experimental validations on three different baseline MLLMs demonstrate significant performance improvements up to 9.5\%, 4.6\% and 9.7\% on LongVideoBench, VideoMME and MLVU, respectively. APVR achieves state-of-the-art results for both training-free and training-based approaches.

APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

TL;DR

APVR addresses hour-level video understanding by introducing a training-free, dual-granularity retrieval framework that hierarchically preserves salient visual information. It combines Pivot Frame Retrieval, which expands queries into objects, descriptions, relations, and semantics and scores frames via CLIP and Grounding-DINO, with Pivot Token Retrieval, which applies query-aware cross-layer attention and adaptive token selection to keep a compact, highly relevant token set. The framework uses iterative adaptive resampling and temporal diffusion to maintain temporal coherence while staying within memory constraints, enabling processing of up to frames for hour-long videos. Empirical results on LongVideoBench, VideoMME, and MLVU show state-of-the-art performance among both training-free and training-based methods, with notable improvements (e.g., up to ) over baselines across multiple MLLMs. APVR’s plug-and-play design and training-free nature offer a scalable, model-agnostic path to robust long-form video understanding.

Abstract

Current multimodal large language models (MLLMs) struggle with hour-level video understanding, facing significant challenges not only in modeling the substantial information volume of long videos but also in overcoming the memory wall and resource constraints during both training and inference. Although recent training-free approaches have alleviated resource demands by compressing visual features, their reliance on incomplete visual information limits the performance potential. To address these limitations, we propose Adaptive Pivot Visual information Retrieval (APVR), a training-free framework that hierarchically retrieves and retains sufficient and important visual information. It breakthroughs the memory wall limitation via two complementary components: Pivot Frame Retrieval employs query expansion and iterative spatio-semantic confidence scoring to identify relevant video frames, and Pivot Token Retrieval performs query-aware attention-driven token selection within up to 1024 pivot frames. This dual granularity approach enables the processing of hour-long videos while maintaining semantic fidelity. Experimental validations on three different baseline MLLMs demonstrate significant performance improvements up to 9.5\%, 4.6\% and 9.7\% on LongVideoBench, VideoMME and MLVU, respectively. APVR achieves state-of-the-art results for both training-free and training-based approaches.

Paper Structure

This paper contains 25 sections, 20 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Given a long video, uniform sampling (a) and insufficient frame retrieval (b) yield incorrect answer, while naively increasing pivot frames (c) raises OOM. Our APVR (d) explores joint frame-token co-retrieval, concentrating computational resources on the most relevant information.
  • Figure 2: The overall framework of our proposed APVR. We integrate two plug-and-play components, Pivot Frame Retrieval and Pivot Token Retrieval, into MLLMs to improve the performance of long video understanding. APVR is training-free and provides a alternative to parameter scaling. With frame-level and token-level adaptive selection, it can accurately understand complex videos with computational efficiency.
  • Figure 3: Qualitative Comparison of APVR with the baseline MLLM. The expanded query is significant complementary for pivot information retrieval. The number and score of the pivot frame is drawn in the left-top with green and red text, respectively. Yellow stars indicate the key frame for correct response. Green answer represents correct while red one represents wrong.
  • Figure 4: The result of ablation study on LongVideoBench for the two designed components: PFR and PTR.
  • Figure 5: The Prompt Template for Quey Expansion.