Table of Contents
Fetching ...

OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

Feng Chen, Yefei He, Shaoxuan He, Yuanyu He, Jing Liu, Lequan Lin, Akide Liu, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Bohan Zhuang, Qi Wu

TL;DR

OmniSparse tackles the training-inference gap and inefficiencies of sparse attention in long-video MLLMs by introducing a training-aware, fine-grained sparsity framework. It jointly optimizes across queries, key-values, and heads through query selection, head-aware KV budgeting, and KV cache slimming, applying the same sparsity pattern during training and inference. The method achieves full-attention–level performance while delivering up to 2.7x prefill speedup and 2.4x memory reduction during decoding, outperforming prior training-aware and training-free sparse approaches. This approach enables scalable long-context multimodal reasoning with practical deployment gains in speed and memory for long-video tasks.

Abstract

Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional redundancy; (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall; and (3) KV cache slimming to reduce head-level redundancy by selectively fetching visual KV cache according to the head-level decoding query pattern. Experimental results show that OmniSparse matches the performance of full attention while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.

OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

TL;DR

OmniSparse tackles the training-inference gap and inefficiencies of sparse attention in long-video MLLMs by introducing a training-aware, fine-grained sparsity framework. It jointly optimizes across queries, key-values, and heads through query selection, head-aware KV budgeting, and KV cache slimming, applying the same sparsity pattern during training and inference. The method achieves full-attention–level performance while delivering up to 2.7x prefill speedup and 2.4x memory reduction during decoding, outperforming prior training-aware and training-free sparse approaches. This approach enables scalable long-context multimodal reasoning with practical deployment gains in speed and memory for long-video tasks.

Abstract

Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional redundancy; (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall; and (3) KV cache slimming to reduce head-level redundancy by selectively fetching visual KV cache according to the head-level decoding query pattern. Experimental results show that OmniSparse matches the performance of full attention while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.

Paper Structure

This paper contains 12 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) The majority of queries focus on fewer than 100 out of 2,600 tokens and can be pruned from attention with minimal performance degradation. (b) The dynamic sparsity across layers and heads suggests that head-wise budget allocation could improve efficiency, but determining an optimal budget for each head is computationally expensive. (c) Heterogeneous token focus across heads (selected keys are highlighted in red) necessitates the head-level KV selection. Data collected from LLaVA-Video-7b llavavideo with VideoMME videomme.
  • Figure 2: Overview of OmniSparse. During prefill, head-level queries are selected by probing query patterns, with a threshold $\tau$ used to filter out redundant queries. KV selection selects the top $b$ salient KV pairs for each head, with the budget $b$ determined by the flattest head to ensure attention recall exceeds the retention ratio $p$ for all heads. During decoding, only relevant visual KV pairs are fetched for active decoding queries.
  • Figure 3: Query redundancy: Queries from different heads (left), spatially adjacent positions (middle), and temporally adjacent positions (right) focus on similar tokens.
  • Figure 4: Prefill latency (left) and decoding memory usage (right) under varying sequence lengths on LLaVA-Video-7b.
  • Figure 5: Layer-wise sparsity difference between flattest and sharpest heads on LLaVA-Video-7b.