La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

Kai Liu; Bowen Xu; Shaoyu Wu; Xin Chen; Hao Zhou; Yongliang Tao; Lulu Hu

La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, Lulu Hu

TL;DR

LaRoSA addresses the inefficiency of activation sparsity in LLMs by introducing training-free, layerwise orthogonal rotations that rotate activations prior to sparsification and can be absorbed into subsequent weights. By applying Top-K sparsification on rotated activations and using PCA-derived rotations, LaRoSA achieves consistent model-level sparsity and reliable wall-clock speed-ups across multiple models and sparsity levels, outperforming prior methods like TEAL and CATS. The approach is complemented by a hardware-aware Triton-based kernel and shows minimal perplexity loss (e.g., 0.17 PPL gap on 40% sparsity for LLaMA2-7B) and strong zero-shot/few-shot performance gains, illustrating practical impact for efficient LLM deployment. Overall, LaRoSA provides a scalable, training-free framework that preserves model accuracy while delivering stable acceleration, enabling more accessible inference on resource-constrained environments.

Abstract

Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30x wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%.

La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

TL;DR

Abstract

La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)