VideoNSA: Native Sparse Attention Scales Video Understanding
Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu
TL;DR
VideoNSA introduces a hardware-aware native sparse attention framework to scale video-language understanding to ultra-long contexts (up to 128K tokens). By combining a three-branch sparse attention mechanism (compression, selection, sliding window) with grouped-query attention for text, and training end-to-end on a large video-instruction dataset, the approach achieves competitive results across long-video, temporal reasoning, and spatial benchmarks while maintaining efficiency. Key contributions include a dynamic gating scheme, insights into attention budget allocation, and findings on attention sinks and scalability. The framework demonstrates practical impact by enabling long-range video reasoning with reduced compute, paving the way for more capable video foundation models.
Abstract
Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.
