SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Jintao Zhang; Haoxu Wang; Kai Jiang; Kaiwen Zheng; Youhe Jiang; Ion Stoica; Jianfei Chen; Jun Zhu; Joseph E. Gonzalez

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica, Jianfei Chen, Jun Zhu, Joseph E. Gonzalez

TL;DR

SLA2 tackles the inefficiencies in Sparse-Linear Attention by introducing a learnable routing mechanism and a decomposition-consistent mixing of sparse and linear attention, thereby aligning the model's computation with the original sparse-plus-low-rank motivation. It further enhances efficiency with quantization-aware training to enable low-bit attention, maintaining high video generation quality in diffusion models. Key contributions include a learnable router that dynamically splits attention, a direct sparse–linear formulation with a tunable mixing factor $\alpha$, and integration of QAT for speedups, achieving up to 97% attention sparsity and an $18.6\times$ speedup on video diffusion tasks. The approach yields practical benefits for real-time or resource-constrained diffusion-model applications, enabling high-sparsity attention without sacrificing generation fidelity.

Abstract

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

TL;DR

, and integration of QAT for speedups, achieving up to 97% attention sparsity and an

speedup on video diffusion tasks. The approach yields practical benefits for real-time or resource-constrained diffusion-model applications, enabling high-sparsity attention without sacrificing generation fidelity.

Abstract

Paper Structure (26 sections, 23 equations, 5 figures, 2 tables, 3 algorithms)

This paper contains 26 sections, 23 equations, 5 figures, 2 tables, 3 algorithms.

Introduction
Preliminaries
Sparse-Linear Attention
Notation.
Rethinking Sparse-Linear Attention
Original motivation of Sparse-Linear Attention.
Error of the sparse attention branch.
How SLA compensates for the mismatch.
A more reasonable formulation.
SLA2 Design
Learnable Router
Quantization-aware Training
Forward (low-bit attention).
Backward (FP16-only).
Training with SLA2
...and 11 more sections

Figures (5)

Figure 1: Attention computation pipeline of SLA2.
Figure 2: Visible examples of SLA2 and baselines on Wan2.1-T2V-1.3B-480P model. The prompt used for generation is in Appendix \ref{['sec:prompts']}.
Figure 3: Visible examples of SLA2 and baselines on Wan2.1-T2V-14B-720P model. The prompt used for generation is in Appendix \ref{['sec:prompts']}.
Figure 4: Kernel speed of SLA2 and baselines with different sparsities.
Figure 5: End-to-end generation latency of SLA2 and baselines with different sparsities.

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

TL;DR

Abstract

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Authors

TL;DR

Abstract

Table of Contents

Figures (5)