Table of Contents
Fetching ...

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu

Abstract

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Abstract

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

Paper Structure

This paper contains 59 sections, 5 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Tempo achieves SOTA long video understanding via query-aware Adaptive Token Allocation (ATA). (a) Motivation: Query-agnostic methods either miss transient moments (sparse sampling) or blur details (uniform pooling). Tempo instead utilizes a small vision-language model as a smart compressor for query-aware cross-modal distillation. (b) Mechanism: ATA dynamically allocates high bandwidth (16 tokens/frame) to relevant segment for fine-grained details, while compressing redundant contexts into minimal temporal anchors ($\sim$0.5 tokens/frame) to maintain causality. (c) Result: Leading performance on LVBench. Tempo-6B achieves superior accuracy at extreme compression rates (i.e., 4 or 6 tokens/frame), outperforming open-source models and proprietary baselines with a fraction of the context budget.
  • Figure 2: Overview of the Tempo framework. Our unified architecture casts long video understanding as an end-to-end, query-aware compression process. The Local Compressor (Left). For each segment, a Small Vision-Language Model (SVLM) acts as a semantic temporal compressor. Under causal attention, learnable memory tokens $\mathbf{M}$ inherently distill the preceding visual tokens $\mathbf{X}_i$ and user query $Q$. Inference-Only Bypass (Middle). During a single forward pass, an Adaptive Token Allocation (ATA) controller intercepts the hidden state $\mathbf{h}_i^{\mathrm{rel}}$ to compute a zero-shot relevance score $s_i$. This enables an $\mathcal{O}(1)$ dynamic head truncation, allocating dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to strictly satisfy a global budget $B_{\max}$. The Global Decoder (Right). The compressed memory tokens are assembled into a highly sparse, time-aware sequence using explicit temporal tags (e.g., <t=2.0s>). A global LLM synthesizes this condensed multimodal context to generate the final response.
  • Figure 3: Scaling behavior of Tempo. We investigate the interplay between maximum frame capacity ($f_{\max}$) and total visual token budgets. (Left) On Video-MME (Long), a strict 4K budget acts as an optimal sweet spot by aggressively filtering redundancy, whereas larger budgets (8K/12K) yield marginal noise. (Right) On the extreme-long LVBench, restrictive budgets eventually limit the achievable performance, whereas expansive capacities (e.g., 12K) monotonically unlock new peaks at higher frame densities, proving the necessity of scaled context for hour-long video understanding.
  • Figure A: Distribution of allocated tokens per segment. (Top) 4K budget ($B=4096$); (Bottom) 8K budget ($B=8192$). Across four benchmarks, ATA consistently exhibits a strongly right-skewed, long-tailed allocation pattern. The majority of segments are compressed into very low-token representations, while a small fraction receives substantially higher allocations for query-aligned segments. Notably, this distribution pattern remains stable under different global budgets.
  • Figure B: Macro-level budget utilization and adaptation efficiency. (Top) 4K budget; (Bottom) 8K budget. Each red dot denotes the average token consumption per segment for a video sample. The dashed green line indicates the dataset-level average theoretical capacity. Points above the line correspond to shorter videos whose per-segment capacity is higher than the dataset-wide average. Adaptability: On datasets with diverse video lengths (e.g., LongVideoBench, Video-MME), the consumption distribution remains well below the theoretical capacity, indicating query-driven compression. Reliability: On extremely long videos (LVBench), the token consumption forms a clear ceiling at the theoretical limit, demonstrating strict adherence to the global budget constraint.
  • ...and 1 more figures