Table of Contents
Fetching ...

State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models

Geewook Kim, Minjoon Seo

TL;DR

This work tackles the token explosion problem in hour-long video understanding for large multimodal models by introducing MambaMia, a modular, pre-LLM compression framework. It relies on a two-stage hierarchy: a Spatiotemporal Compression Layer with Gated Patch Aggregation (GPA) to produce compact anchor tokens, and a Time Axis Aggregator (TAA) that models temporal dynamics and performs adaptive frame selection via delta-time signals. The approach combines state-space modeling (Mamba) with learned, query-conditioned pooling and delta-based frame filtering, yielding substantial token reductions (for example, about 4.7K tokens on LVBench) while maintaining competitive accuracy across LVBench, MLVU, and other long-video benchmarks. Experimental results include extensive ablations confirming the contributions of GPA and delta-based sampling, as well as strong cross-backbone performance and open-source availability, suggesting practical feasibility for real-world long-video reasoning tasks.

Abstract

We propose an efficient framework to compress massive video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from hour-long videos. Our design leverages a bidirectional state-space model equipped with a gated skip connection and a learnable weighted-average pooling mechanism applied to periodically inserted learned queries. This structure enables hierarchical downsampling across both spatial and temporal dimensions, preserving performance in a cost-effective manner. Across challenging hour-long video understanding tasks, our approach demonstrates competitive results against state-of-the-art models, while significantly reducing overall token budget. Notably, replacing our state-space model with conventional modules results in substantial performance degradation, highlighting the advantages of the proposed state-space modeling for effectively compressing multi-frame video information. Our framework emphasizes resource-conscious efficiency, making it practical for real-world deployments. We validate its scalability and generality across multiple benchmarks, achieving the dual objectives of efficient resource usage and comprehensive video understanding.

State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models

TL;DR

This work tackles the token explosion problem in hour-long video understanding for large multimodal models by introducing MambaMia, a modular, pre-LLM compression framework. It relies on a two-stage hierarchy: a Spatiotemporal Compression Layer with Gated Patch Aggregation (GPA) to produce compact anchor tokens, and a Time Axis Aggregator (TAA) that models temporal dynamics and performs adaptive frame selection via delta-time signals. The approach combines state-space modeling (Mamba) with learned, query-conditioned pooling and delta-based frame filtering, yielding substantial token reductions (for example, about 4.7K tokens on LVBench) while maintaining competitive accuracy across LVBench, MLVU, and other long-video benchmarks. Experimental results include extensive ablations confirming the contributions of GPA and delta-based sampling, as well as strong cross-backbone performance and open-source availability, suggesting practical feasibility for real-world long-video reasoning tasks.

Abstract

We propose an efficient framework to compress massive video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from hour-long videos. Our design leverages a bidirectional state-space model equipped with a gated skip connection and a learnable weighted-average pooling mechanism applied to periodically inserted learned queries. This structure enables hierarchical downsampling across both spatial and temporal dimensions, preserving performance in a cost-effective manner. Across challenging hour-long video understanding tasks, our approach demonstrates competitive results against state-of-the-art models, while significantly reducing overall token budget. Notably, replacing our state-space model with conventional modules results in substantial performance degradation, highlighting the advantages of the proposed state-space modeling for effectively compressing multi-frame video information. Our framework emphasizes resource-conscious efficiency, making it practical for real-world deployments. We validate its scalability and generality across multiple benchmarks, achieving the dual objectives of efficient resource usage and comprehensive video understanding.

Paper Structure

This paper contains 68 sections, 7 equations, 7 figures, 13 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the MambaMia framework. Given a long video, we densely sample frames and embed patches to form a large sequence of visual tokens. Our framework then applies two-stage compression: (i) a spatiotemporal compression layer with periodic learnable queries aggregates local features, (ii) a time-axis aggregator uses delta-time values for adaptive frame selection. This pipeline efficiently reduces token count while preserving rich video context for LLM processing.
  • Figure 2: Architecture. (a) Periodic query tokens aggregate local context from nearby tokens using learnable pooling and a gating mechanism. (b) Frame-wise queries are extracted and reorganized into temporal sequences. (c) The time-axis aggregator models temporal dependencies and uses delta-time values for adaptive frame sampling before LLM input.
  • Figure 3: Trade-off analysis between input frames, inference latency, and accuracy. We benchmark five models (Proposed, BIMBA, Self-Attention, Pooling, and Vanilla) over varying input frame counts. (a) Average accuracy across LVBench, MLVU, and VideoMME as a function of frame number; (b) Inference latency (seconds) versus frame number; (c) Average accuracy as a function of inference time cost (symlog scale). The proposed method achieves the best balance, maintaining high accuracy with low test-time compute even as sequence length increases. Notably, (c) shows that our method consistently delivers the highest accuracy under any compute budget, clearly outperforming baselines—especially in high-latency regimes.
  • Figure 4: Comparison of training and inference costs for all baselines. (a) Training runtime (8A100-hours) at 64-frame input. (b) Peak per-GPU memory usage during inference at 256 frames. All models leverage FlashAttention-2 dao2023flashattention2fasterattentionbetter where possible. Self-Attention without Flash-Attn (red hatched bar) causes OOM errors at 256 frames.
  • Figure 5: Visualization of per-frame delta-time values produced by the TAA layer. Peaks correspond to scene transitions or distinctive events (e.g., needle-in-a-video-haystack).
  • ...and 2 more figures