Table of Contents
Fetching ...

Slow-Fast Architecture for Video Multi-Modal Large Language Models

Min Shi, Shihao Wang, Chieh-Yun Chen, Jitesh Jain, Kai Wang, Junjun Xiong, Guilin Liu, Zhiding Yu, Humphrey Shi

TL;DR

The paper addresses the challenge of balancing temporal resolution and spatial detail in video MLLMs under compute constraints by introducing a slow-fast architecture that uses a small set of fast visual tokens as a quick preview and retains uncompressed slow tokens for cross-attention. This design yields linear complexity with video length and scales to longer inputs, achieving substantial performance gains across benchmarks. Empirical results show a significant average improvement and competitive state-of-the-art performance for a 7B model among similar sizes, while maintaining low additional computation and enabling plug-and-play integration. The approach enhances reasoning, OCR, and information extraction from video input, offering practical efficiency and scalability for video-based MLLMs.

Abstract

Balancing temporal resolution and spatial detail under limited compute budget remains a key challenge for video-based multi-modal large language models (MLLMs). Existing methods typically compress video representations using predefined rules before feeding them into the LLM, resulting in irreversible information loss and often ignoring input instructions. To address this, we propose a novel slow-fast architecture that naturally circumvents this trade-off, enabling the use of more input frames while preserving spatial details. Inspired by how humans first skim a video before focusing on relevant parts, our slow-fast design employs a dual-token strategy: 1) "fast" visual tokens -- a compact set of compressed video features -- are fed into the LLM alongside text embeddings to provide a quick overview; 2) "slow" visual tokens -- uncompressed video features -- are cross-attended by text embeddings through specially designed hybrid decoder layers, enabling instruction-aware extraction of relevant visual details with linear complexity. We conduct systematic exploration to optimize both the overall architecture and key components. Experiments show that our model significantly outperforms self-attention-only baselines, extending the input capacity from 16 to 128 frames with just a 3% increase in computation, and achieving a 16% average performance improvement across five video understanding benchmarks. Our 7B model achieves state-of-the-art performance among models of similar size. Furthermore, our slow-fast architecture is a plug-and-play design that can be integrated into other video MLLMs to improve efficiency and scalability.

Slow-Fast Architecture for Video Multi-Modal Large Language Models

TL;DR

The paper addresses the challenge of balancing temporal resolution and spatial detail in video MLLMs under compute constraints by introducing a slow-fast architecture that uses a small set of fast visual tokens as a quick preview and retains uncompressed slow tokens for cross-attention. This design yields linear complexity with video length and scales to longer inputs, achieving substantial performance gains across benchmarks. Empirical results show a significant average improvement and competitive state-of-the-art performance for a 7B model among similar sizes, while maintaining low additional computation and enabling plug-and-play integration. The approach enhances reasoning, OCR, and information extraction from video input, offering practical efficiency and scalability for video-based MLLMs.

Abstract

Balancing temporal resolution and spatial detail under limited compute budget remains a key challenge for video-based multi-modal large language models (MLLMs). Existing methods typically compress video representations using predefined rules before feeding them into the LLM, resulting in irreversible information loss and often ignoring input instructions. To address this, we propose a novel slow-fast architecture that naturally circumvents this trade-off, enabling the use of more input frames while preserving spatial details. Inspired by how humans first skim a video before focusing on relevant parts, our slow-fast design employs a dual-token strategy: 1) "fast" visual tokens -- a compact set of compressed video features -- are fed into the LLM alongside text embeddings to provide a quick overview; 2) "slow" visual tokens -- uncompressed video features -- are cross-attended by text embeddings through specially designed hybrid decoder layers, enabling instruction-aware extraction of relevant visual details with linear complexity. We conduct systematic exploration to optimize both the overall architecture and key components. Experiments show that our model significantly outperforms self-attention-only baselines, extending the input capacity from 16 to 128 frames with just a 3% increase in computation, and achieving a 16% average performance improvement across five video understanding benchmarks. Our 7B model achieves state-of-the-art performance among models of similar size. Furthermore, our slow-fast architecture is a plug-and-play design that can be integrated into other video MLLMs to improve efficiency and scalability.

Paper Structure

This paper contains 15 sections, 2 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparison between the mainstream video MLLM architecture and the proposed slow-fast architecture. Rather than relying on carefully-designed video representation compression strategies, the slow-fast architecture utilizes highly compressed "fast" visual tokens as a preview for the LLM while allowing text embeddings to extract relevant information from uncompressed "slow" visual tokens via cross-attention. This approach extends a 16-frame baseline to a 96-frame input with only a 2% increase in computation, yielding a 14% average performance improvement across five benchmarks.
  • Figure 2: Illustration of the Slow-Fast Architecture and Hybrid Decoder. The video input is first processed into slow visual tokens through a vision encoder and projector. These slow visual tokens are then condensed into a smaller set of fast visual tokens via strided sampling and temporal pooling. The fast visual tokens are concatenated with text embeddings and fed into the LLM, serving as a preview context. Meanwhile, the slow visual tokens interact with text embeddings through cross-attention in hybrid decoder layers distributed within the LLM, enabling instruction-aware visual information extraction with linear complexity.
  • Figure 3: Qualitative examples and comparisons between different input frame numbers. For the video on the left, models trained and tested with 64 and 96 frames are compared, denoted as "64x" and "96x". In the video on the right, we further apply test time augmentation by increasing the input frames to 192. More comparisons are available in the supplement.
  • Figure 4: Visualizations of the cross-attention map and the dynamic gate in the hybrid decoder. The cross-attention maps are averaged across different decoder layers, text tokens, and attention heads. The absolute value of the dynamic gate from all the four hybrid decoder layers are visualized.
  • Figure 5: More qualitative examples of our model.
  • ...and 4 more figures