Table of Contents
Fetching ...

STORM: Token-Efficient Long Video Understanding for Multimodal LLMs

Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon

TL;DR

STORM addresses the inefficiency of long-video understanding in Video-LLMs by inserting a Mamba-based temporal encoder between the image encoder and the LLM, enabling explicit spatiotemporal token enrichment. The core idea is to propagate temporal history into visual tokens and then apply training-time and test-time token compression, dramatically reducing the tokens fed to the LLM while preserving essential dynamics. Empirical results show state-of-the-art performance on several long-video benchmarks with substantial reductions in tokens and decoding latency, including up to 8× computation savings and 2.4–2.9× faster decoding for fixed input frames. The approach also demonstrates robustness across architectures, supports streaming, and scales to longer temporal contexts, highlighting a practical path toward efficient, robust long-video understanding in multimodal systems.

Abstract

Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to $8\times$ and the decoding latency by 2.4-2.9$\times$ for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm

STORM: Token-Efficient Long Video Understanding for Multimodal LLMs

TL;DR

STORM addresses the inefficiency of long-video understanding in Video-LLMs by inserting a Mamba-based temporal encoder between the image encoder and the LLM, enabling explicit spatiotemporal token enrichment. The core idea is to propagate temporal history into visual tokens and then apply training-time and test-time token compression, dramatically reducing the tokens fed to the LLM while preserving essential dynamics. Empirical results show state-of-the-art performance on several long-video benchmarks with substantial reductions in tokens and decoding latency, including up to 8× computation savings and 2.4–2.9× faster decoding for fixed input frames. The approach also demonstrates robustness across architectures, supports streaming, and scales to longer temporal contexts, highlighting a practical path toward efficient, robust long-video understanding in multimodal systems.

Abstract

Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to and the decoding latency by 2.4-2.9 for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm

Paper Structure

This paper contains 46 sections, 8 equations, 13 figures, 15 tables.

Figures (13)

  • Figure 1: Open-Ended Video Understanding. We show STORM's ability to handle free-form queries about complex long video scenes. By employing the Mamba-based temporal encoder to capture essential spatiotemporal cues while compressing redundant frame information, STORM enables efficient, accurate long-video understanding and outperforms existing methods on a wide range of video understanding tasks.
  • Figure 2: Overview of STORM pipeline. STORM integrates a Mamba-based temporal projector between the image encoder and LLM. This projector performs spatiotemporal scanning to embed temporal information directly into visual tokens. The resulting Summary Tokens encapsulate temporal history, enabling efficient downstream token reduction while preserving essential video dynamics.
  • Figure 3: Token Compression Strategies. This figure illustrates our token compression techniques: spatial average pooling (left), temporal average pooling (middle), and training-free temporal token sampling (right). These methods can be applied individually or in combination, depending on task requirements and computational budget constraints.
  • Figure 4: Model Efficiency and Effectiveness on Long Video Inputs.(left) Profiling results of token compression as the number of frames increases during inference. (middle) Profiling results for 256 input frames with different compression ratios on a single A100. (right) The accuracy of Video-MME (without subtitles) across different numbers of frames during inference. While STORM with test-time temporal sampling showed consistent performance improvements, both VILA and STORM without token compression demonstrated decreased performance beyond 64 frames.
  • Figure 5: Qualitative Examples of STORM + T. Pooling. Our model effectively processes complex video content across various tasks requiring fine-grained temporal and visual understanding while reducing computational overhead through efficient token compression. The example videos can be found in https://research.nvidia.com/labs/lpr/storm.
  • ...and 8 more figures