Table of Contents
Fetching ...

La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, Lulu Hu

TL;DR

LaRoSA addresses the inefficiency of activation sparsity in LLMs by introducing training-free, layerwise orthogonal rotations that rotate activations prior to sparsification and can be absorbed into subsequent weights. By applying Top-K sparsification on rotated activations and using PCA-derived rotations, LaRoSA achieves consistent model-level sparsity and reliable wall-clock speed-ups across multiple models and sparsity levels, outperforming prior methods like TEAL and CATS. The approach is complemented by a hardware-aware Triton-based kernel and shows minimal perplexity loss (e.g., 0.17 PPL gap on 40% sparsity for LLaMA2-7B) and strong zero-shot/few-shot performance gains, illustrating practical impact for efficient LLM deployment. Overall, LaRoSA provides a scalable, training-free framework that preserves model accuracy while delivering stable acceleration, enabling more accessible inference on resource-constrained environments.

Abstract

Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30x wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%.

La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

TL;DR

LaRoSA addresses the inefficiency of activation sparsity in LLMs by introducing training-free, layerwise orthogonal rotations that rotate activations prior to sparsification and can be absorbed into subsequent weights. By applying Top-K sparsification on rotated activations and using PCA-derived rotations, LaRoSA achieves consistent model-level sparsity and reliable wall-clock speed-ups across multiple models and sparsity levels, outperforming prior methods like TEAL and CATS. The approach is complemented by a hardware-aware Triton-based kernel and shows minimal perplexity loss (e.g., 0.17 PPL gap on 40% sparsity for LLaMA2-7B) and strong zero-shot/few-shot performance gains, illustrating practical impact for efficient LLM deployment. Overall, LaRoSA provides a scalable, training-free framework that preserves model accuracy while delivering stable acceleration, enabling more accessible inference on resource-constrained environments.

Abstract

Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30x wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%.

Paper Structure

This paper contains 18 sections, 1 theorem, 20 equations, 5 figures, 19 tables, 1 algorithm.

Key Result

Theorem 1.1

Let $\tilde{{\bm{x}}}\in\mathbb{R}^{D_\text{in}}$, $\tilde{{\mathbf{W}}}\in\mathbb{R}^{D_\text{in}\times D_\text{out}}$ are independent and identically Gaussian distributed, where $\tilde{{\bm{x}}}_i \sim N(0, \sigma_x)$ and $\tilde{{\mathbf{W}}}_{ji} \sim N(0, \sigma_w)$. Furthermore, for top-$k$ s where $\varphi(t) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}t^2}$ is the standard Gaussian probability

Figures (5)

  • Figure 1: LaRoSA rotate and prune the input activations to achieve more consistent and efficient LLM inference. The rotation is orthogonal and can be reversed after pruning. The overall method provide a more accurate approximation of the full computation. Visualizations are from the 12th Attention block of LLaMA2-7B.
  • Figure 2: Model inference with LaRoSA. (a)&(b) The layerwise orthogonal matrix ${\mathbf{Q}}_l$ can be absorbed into weight matrices to avoid extra computations, ensuring that the input activations for each block are pre-transformed by the layer-specific orthogonal matrix ${\mathbf{Q}}_l$. (c) Introducing residual adapters in the residual stream ensures that each layer's input activations have independent orthogonal rotations. The ${\mathbf{Q}}_0$ of first layer and ${\mathbf{Q}}_n$ of last layer can be merged into the weight matrix of token embedding and LM head layer, respectively.
  • Figure 3: (a) Offline calibrated thresholds for Attention and MLP blocks of LLaMA3-8B. (b) The calibrated and actual needed thresholds in 24th Attention block of LLaMA3-8B. Input tokens are randomly selected from WikiText2. The dashed lines denote the calibrated thresholds, while scatter points indicate real thresholds needed to achieve the target sparsity. Many blue points (50% actual thresholds) are even below the black line (40% calibrated threshold). (c) Average actual activation sparsity of LLaMA3-8B's Attention blocks. Input sequences are from different datasets and sorted by the sparsity. (d) The inference speedup of magnitude pruning at 50% sparsity under various output sequence lengths. Speedup equal to 1.0 is the dense version of LLaMA3-8B model. Best view in color and zoom in.
  • Figure 4: Comparison on inference speed-ups at 50% sparsity. Experiments are conducted on NVIDIA A100 GPUs.
  • Figure 5: Relative output error results from 14th Layer of LLaMA3-8B. Theoretical errors are calculated with Top-K based sparsification. Empirical errors are derived from the output of ${\mathbf{W}}_\text{down}$ projection.

Theorems & Definitions (2)

  • Theorem 1.1
  • proof