Table of Contents
Fetching ...

TriangleMix: Accelerating Prefilling via Decoding-time Contribution Sparsity

Zhiyuan He, Yike Zhang, Chengruidong Zhang, Huiqiang Jiang, Yuqing Yang, Lili Qiu

TL;DR

This work addresses the bottleneck of quadratic attention in the prefilling stage of long-context LLMs by identifying decoding-time contribution sparsity, a form of sparsity where some non-trivially scored blocks contribute little to decoding. It introduces TriangleMix, a training-free static attention pattern that assigns Triangle attention to a subset of layers while keeping others dense, reducing complexity from $O(N^2)$ to $O(N)$ in the Triangle layers. A gradient-based probing framework quantifies block importance and guides layer selection, yielding near lossless accuracy across three long-context models and benchmarks. Empirically, TriangleMix achieves up to 15.3x attention speedups at 128K inputs and 12%–32% TTFT reductions, with additional gains (6%–19% TTFT) when combined with dynamic sparsity methods, making long-context inference substantially faster with minimal accuracy trade-offs.

Abstract

Large Language Models (LLMs) incur quadratic attention complexity with input length, creating a major time bottleneck in the prefilling stage. Existing acceleration methods largely exploit attention score sparsity by estimating blocks with high attention scores and applying dynamic sparse attention. In this work, we identify another untapped form of sparsity in the prefilling stage, namely decoding-time contribution sparsity, where many attention blocks exhibit nontrivial attention scores during prefilling yet contribute negligibly to subsequent decoding, as indicated by gradient-based analysis. Building on this observation, we propose TriangleMix, a training-free static attention pattern that uses dense attention in a subset of layers and switches to Triangle attention in the others. Extensive experiments show that TriangleMix preserves nearly lossless performance relative to dense attention while substantially reducing attention overhead in Triangle layers. For 128K inputs, Triangle attention achieves a 15.3x speedup in attention computation, significantly exceeding the acceleration of typical dynamic sparse methods (1.9x to 3.4x). Furthermore, TriangleMix can be seamlessly combined with dynamic sparsity approaches, delivering an additional 6% to 19% reduction in TTFT over using dynamic sparsity alone.

TriangleMix: Accelerating Prefilling via Decoding-time Contribution Sparsity

TL;DR

This work addresses the bottleneck of quadratic attention in the prefilling stage of long-context LLMs by identifying decoding-time contribution sparsity, a form of sparsity where some non-trivially scored blocks contribute little to decoding. It introduces TriangleMix, a training-free static attention pattern that assigns Triangle attention to a subset of layers while keeping others dense, reducing complexity from to in the Triangle layers. A gradient-based probing framework quantifies block importance and guides layer selection, yielding near lossless accuracy across three long-context models and benchmarks. Empirically, TriangleMix achieves up to 15.3x attention speedups at 128K inputs and 12%–32% TTFT reductions, with additional gains (6%–19% TTFT) when combined with dynamic sparsity methods, making long-context inference substantially faster with minimal accuracy trade-offs.

Abstract

Large Language Models (LLMs) incur quadratic attention complexity with input length, creating a major time bottleneck in the prefilling stage. Existing acceleration methods largely exploit attention score sparsity by estimating blocks with high attention scores and applying dynamic sparse attention. In this work, we identify another untapped form of sparsity in the prefilling stage, namely decoding-time contribution sparsity, where many attention blocks exhibit nontrivial attention scores during prefilling yet contribute negligibly to subsequent decoding, as indicated by gradient-based analysis. Building on this observation, we propose TriangleMix, a training-free static attention pattern that uses dense attention in a subset of layers and switches to Triangle attention in the others. Extensive experiments show that TriangleMix preserves nearly lossless performance relative to dense attention while substantially reducing attention overhead in Triangle layers. For 128K inputs, Triangle attention achieves a 15.3x speedup in attention computation, significantly exceeding the acceleration of typical dynamic sparse methods (1.9x to 3.4x). Furthermore, TriangleMix can be seamlessly combined with dynamic sparsity approaches, delivering an additional 6% to 19% reduction in TTFT over using dynamic sparsity alone.

Paper Structure

This paper contains 17 sections, 7 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: The average gradient $\mathrm{Grad}(\bm{M}, l)$ of the Middle Q-K sections, measured on three models, shows a significant sparsity in deeper layers. This suggests that the Middle Q-K components in deeper layers contribute minimally to decoding and might potentially be skipped to improve efficiency.
  • Figure 2: Left: Attention computation in certain layers exhibits contribution locality. Right: The proposed TriangleMix pattern for Llama-3.1-8B-Instruct.
  • Figure 3: First row: Average attention score for the Middle and Last Q-K sections; Second row: average gradient $\mathrm{Grad}(\bm{M}, l)$ for the Middle and Last Q-K sections.
  • Figure 4: Average RULER score at 64K length for different $L_{\mathrm{tri}}$ values.
  • Figure : Fused Triangle Attention