Table of Contents
Fetching ...

VideoNSA: Native Sparse Attention Scales Video Understanding

Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu

TL;DR

VideoNSA introduces a hardware-aware native sparse attention framework to scale video-language understanding to ultra-long contexts (up to 128K tokens). By combining a three-branch sparse attention mechanism (compression, selection, sliding window) with grouped-query attention for text, and training end-to-end on a large video-instruction dataset, the approach achieves competitive results across long-video, temporal reasoning, and spatial benchmarks while maintaining efficiency. Key contributions include a dynamic gating scheme, insights into attention budget allocation, and findings on attention sinks and scalability. The framework demonstrates practical impact by enabling long-range video reasoning with reduced compute, paving the way for more capable video foundation models.

Abstract

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

VideoNSA: Native Sparse Attention Scales Video Understanding

TL;DR

VideoNSA introduces a hardware-aware native sparse attention framework to scale video-language understanding to ultra-long contexts (up to 128K tokens). By combining a three-branch sparse attention mechanism (compression, selection, sliding window) with grouped-query attention for text, and training end-to-end on a large video-instruction dataset, the approach achieves competitive results across long-video, temporal reasoning, and spatial benchmarks while maintaining efficiency. Key contributions include a dynamic gating scheme, insights into attention budget allocation, and findings on attention sinks and scalability. The framework demonstrates practical impact by enabling long-range video reasoning with reduced compute, paving the way for more capable video foundation models.

Abstract

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

Paper Structure

This paper contains 33 sections, 10 equations, 29 figures, 29 tables.

Figures (29)

  • Figure 1: Overview of VideoNSA. Video frames are encoded into frame-level KV blocks. VideoNSA utilizes three sparse attention branches during prefilling stage: compression branch reduces redundancy via token averaging, selection branch identifies top-k important tokens, and sliding window branch enforces local temporal coverage. The outputs are combined through dynamic gating before integration with text tokens for LLM decoding.
  • Figure 2: Scaling Performance of VideoNSA under Different Context Allocation Strategies. We highlight the Token Budget Constraint to indicate settings with equal context length, and annotate the best-performing configuration under each benchmark. Since videos in Tomato shangguan2024tomato, we vary FPS instead of total frames, with FPS $\times$ TPF = 128 denoted as $K_0$.
  • Figure 3: Scaling Performance of VideoNSA under Different Attention Allocation Strategies. Scatter points from small to large and from light to dark indicate increasing performance. We annotate the point corresponding to the same attention allocation strategy as used during training and connect configurations with equal attention budgets using solid orange lines. We further scale the best configuration using dashed lines. Percentages show attention relative to full attention.
  • Figure 4: Gate weights across layers in VideoNSA. Compression remains dominant, while selection and sliding-window weaken in later layers.
  • Figure 5: Inter-head similarities of gates in VideoNSA. Selection and sliding-window gates show high similarity in middle layers.
  • ...and 24 more figures