Table of Contents
Fetching ...

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu

TL;DR

MMInference introduces a modality-aware permutation sparse attention framework to accelerate the pre-filling stage of long-context multi-modal VLMs. By identifying grid-like intra-modality patterns, boundary-specific (Q-Boundary, 2D-Boundary) patterns, and an offline pattern-search algorithm, it builds per-head sparse indices and employs optimized GPU kernels for end-to-end speedups. The approach preserves accuracy across video understanding, retrieval, and mixed-modality tasks, achieving up to 8.3× end-to-end speedups at 1M tokens without model fine-tuning. This work significantly enhances the practicality of long-context VLMs in real-world applications by reducing latency while maintaining performance.

Abstract

The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy. Our code is available at https://aka.ms/MMInference.

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

TL;DR

MMInference introduces a modality-aware permutation sparse attention framework to accelerate the pre-filling stage of long-context multi-modal VLMs. By identifying grid-like intra-modality patterns, boundary-specific (Q-Boundary, 2D-Boundary) patterns, and an offline pattern-search algorithm, it builds per-head sparse indices and employs optimized GPU kernels for end-to-end speedups. The approach preserves accuracy across video understanding, retrieval, and mixed-modality tasks, achieving up to 8.3× end-to-end speedups at 1M tokens without model fine-tuning. This work significantly enhances the practicality of long-context VLMs in real-world applications by reducing latency while maintaining performance.

Abstract

The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy. Our code is available at https://aka.ms/MMInference.

Paper Structure

This paper contains 55 sections, 21 figures, 4 tables, 7 algorithms.

Figures (21)

  • Figure 1: Dynamic sparse attention pipelines leverage sparse loading with dense computation zheng2023pit to enable hardware-efficient acceleration. MMInference adopts a bottom-up system-algorithm co-design that accounting for both the mathematical equivalence constraints of sparse loading and the locality properties of real-world attention patterns.
  • Figure 2: (a) Latency breakdown of the pre-filling stage, with 256 tokens per frame. (b) How much element in attention needs to be computed to achieve 95% recall in a 128k context. (c) Low attention recall when reusing the top-k indices from a different request. Visualizations are based on LongVILA-7B-1M xue2024longvila with a single A100.
  • Figure 3: Visualization of pre- vs. post-permutation sparsity attention patterns in VLMs.
  • Figure 4: The framework of MMInference, encompassing both inter- and intra-modality sparse attention patterns.
  • Figure 5: V-NIAH zhang2024long and MM-NIAH results using LongVila-Qwen2-7B-1M xue2024longvila.
  • ...and 16 more figures