OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs
Feng Chen, Yefei He, Shaoxuan He, Yuanyu He, Jing Liu, Lequan Lin, Akide Liu, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Bohan Zhuang, Qi Wu
TL;DR
OmniSparse tackles the training-inference gap and inefficiencies of sparse attention in long-video MLLMs by introducing a training-aware, fine-grained sparsity framework. It jointly optimizes across queries, key-values, and heads through query selection, head-aware KV budgeting, and KV cache slimming, applying the same sparsity pattern during training and inference. The method achieves full-attention–level performance while delivering up to 2.7x prefill speedup and 2.4x memory reduction during decoding, outperforming prior training-aware and training-free sparse approaches. This approach enables scalable long-context multimodal reasoning with practical deployment gains in speed and memory for long-video tasks.
Abstract
Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional redundancy; (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall; and (3) KV cache slimming to reduce head-level redundancy by selectively fetching visual KV cache according to the head-level decoding query pattern. Experimental results show that OmniSparse matches the performance of full attention while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.
