Table of Contents
Fetching ...

VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling

Chen Guanzhong

TL;DR

This work proposes VSPrefill, a mechanism requiring lightweight training that uses the vertical-slash structural pattern in attention distributions, and establishes a new Pareto frontier in the trade-off between accuracy and efficiency.

Abstract

The quadratic complexity of self-attention during the prefill phase impedes long-context inference in large language models. Existing sparse attention methods face a trade-off among context adaptivity, sampling overhead, and fine-tuning costs. We propose VSPrefill, a mechanism requiring lightweight training that uses the vertical-slash structural pattern in attention distributions. Our compact VSIndexer module predicts context-aware importance scores for vertical columns and slash diagonals from key-value representations augmented with RoPE. This approach constructs sparse masks with linear complexity without modifying the backbone parameters. During inference, an adaptive cumulative-threshold strategy allocates sparsity budgets per layer, while a fused kernel executes attention with on-the-fly index merging. Evaluated on Qwen3-4B-Instruct and LLaMA-3.1-8B-Instruct across the LongBench and RULER benchmarks, VSPrefill preserves 98.35% of the full attention accuracy while delivering a 4.95x average speedup at a context length of 128k. These results establish a new Pareto frontier in the trade-off between accuracy and efficiency.

VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling

TL;DR

This work proposes VSPrefill, a mechanism requiring lightweight training that uses the vertical-slash structural pattern in attention distributions, and establishes a new Pareto frontier in the trade-off between accuracy and efficiency.

Abstract

The quadratic complexity of self-attention during the prefill phase impedes long-context inference in large language models. Existing sparse attention methods face a trade-off among context adaptivity, sampling overhead, and fine-tuning costs. We propose VSPrefill, a mechanism requiring lightweight training that uses the vertical-slash structural pattern in attention distributions. Our compact VSIndexer module predicts context-aware importance scores for vertical columns and slash diagonals from key-value representations augmented with RoPE. This approach constructs sparse masks with linear complexity without modifying the backbone parameters. During inference, an adaptive cumulative-threshold strategy allocates sparsity budgets per layer, while a fused kernel executes attention with on-the-fly index merging. Evaluated on Qwen3-4B-Instruct and LLaMA-3.1-8B-Instruct across the LongBench and RULER benchmarks, VSPrefill preserves 98.35% of the full attention accuracy while delivering a 4.95x average speedup at a context length of 128k. These results establish a new Pareto frontier in the trade-off between accuracy and efficiency.
Paper Structure (41 sections, 14 equations, 8 figures, 5 tables)

This paper contains 41 sections, 14 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of VSPrefill. The VSIndexer employs a shared-weight bilayer linear network that accepts concatenated key-value pairs as input and outputs vertical and slash attention scores, denoted as $\hat{A}_v$ and $\hat{A}_s$. These scores are trained to approximate the ground-truth full attention weights aggregated along the corresponding directional patterns. During inference, $\hat{A}_v$ and $\hat{A}_s$ undergo top-k selection with a dynamic sparsity budget to construct the vertical-slash sparse attention mask. Implementation details, including the architecture of VSIndexer, RoPE, and sparse attention computation, are omitted for clarity.
  • Figure 2: Accuracy and Perplexity Trends Across Different Attention Recall Levels on HotPotQA dataset.
  • Figure 3: Visualization of dynamic attention sparsity patterns. (a-c) Comparisons reveal high structural similarity within the same KV group (Intra-Group) contrasted with distinct topologies across different groups (Inter-Group). (d) Sparsity patterns evolve significantly as network depth increases. (e-f) The distribution of salient weights shifts in response to varying input prompts and model architectures.
  • Figure 4: Diagonal-aggregated attention heatmap for Layer 0 of Qwen3-4B-Instruct on LongBench. The x-axis denotes the relative diagonal offset ($i-j$) where 0 corresponds to the main diagonal. The emergence of distinct vertical bands at distal offsets validates the existence of consistent slash patterns, characterized by strong correlations at fixed relative distances across multiple heads.
  • Figure 5: Accuracy vs. Speedup across different Attention mechanism on context length ranging from 32k to 128k.
  • ...and 3 more figures