Table of Contents
Fetching ...

VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference

Anmin Liu, Ruixuan Yang, Huiqiang Jiang, Bin Lin, Minmin Sun, Yong Li, Chen Zhang, Tao Xie

Abstract

Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose \textbf{VecAttention}, a novel framework of vector-wise sparse attention that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy-sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized kernel of vector sparse attention. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65$\times$ speedup over full attention and a 1.83$\times$ speedup over state-of-the-art sparse attention methods, with comparable accuracy to full attention. Our code is available at https://github.com/anminliu/VecAttention.

VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference

Abstract

Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose \textbf{VecAttention}, a novel framework of vector-wise sparse attention that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy-sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized kernel of vector sparse attention. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65 speedup over full attention and a 1.83 speedup over state-of-the-art sparse attention methods, with comparable accuracy to full attention. Our code is available at https://github.com/anminliu/VecAttention.

Paper Structure

This paper contains 28 sections, 6 equations, 15 figures, 2 tables, 2 algorithms.

Figures (15)

  • Figure 1: (a–b) Under the same sparsity budget, vertical-vector (V-Vec) achieves higher recall and better approximates the oracle pattern than horizontal-vector (H-Vec) and other coarse line/block patterns, for both VLMs (video understanding) and DiTs (video generation). Markers such as (68.9%, 0.95) denote (sparsity, recall), with sparsity in percentage. (c) On VideoMME at matched full-accuracy settings, VecAttention attains higher effective sparsity and faster attention computation, with low important-region selection overhead, compared to existing coarse-grained methods.
  • Figure 2: Attention maps visualized on different layers across tasks of video understanding and video generation.
  • Figure 3: Sparsity--recall tradeoff of different sparse patterns across long-context video understanding (left three columns) and video generation (right column) tasks. The top row shows results on the causal InternVL-3.5-8B VLM and non-causal HunyuanVideo DiT, while the bottom row presents the causal Qwen2.5-VL-7B VLM and non-causal Wan2.1-T2V-14B DiT. The vertical-vector pattern consistently yields higher recall under the same sparsity than horizontal-vector pattern and coarse line/block pattern, closely tracking the oracle pattern across different datasets and models.
  • Figure 4: The overview of VecAttention, which can be divided into two stages: (1) Important-Vector Selection via TilingSelect and minS Filter; (2) Vector Sparse Attention on selected vectors.
  • Figure 5: Recall heatmaps of different filter strategies under the same sparsity. The minS filter strategy achieves comparable recall under the same sparsity level. Markers such as (75.0%, 0.96) denote (sparsity, recall), with sparsity in percentage.
  • ...and 10 more figures