Table of Contents
Fetching ...

Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters

Roberto Garcia, Jerry Liu, Daniel Sorvisto, Sabri Eyuboglu

TL;DR

The paper tackles the rising inference cost of modern Transformers by moving beyond neuron-based adapters and introducing Adaptive Rank Allocation, with RaNA as a concrete adapter. RaNA replaces linear layers with input-dependent low-rank decompositions, enabling computation to be allocated via rank-aware routers without relying on activation sparsity, and extends to both MLP and QKV components of Transformers. Empirically, RaNA achieves lower reconstruction error and better perplexity/accuracy trade-offs than prior adapters across Llama2-7b, Gemma-2b, and Pythia models at around 40–46% FLOP reductions, including practical latency gains. The approach demonstrates robust applicability to non-sparse activations like SwiGLU and GeLU, highlights the sparsity of rank contributions, and suggests that input-driven rank adaptation can be a powerful generalization of neuron adapters with meaningful real-world speedups.

Abstract

Large Language Models (LLMs) are computationally intensive, particularly during inference. Neuron-adaptive techniques, which selectively activate neurons in Multi-Layer Perceptron (MLP) layers, offer some speedups but suffer from limitations in modern Transformers. These include reliance on sparse activations, incompatibility with attention layers, and the use of costly neuron masking techniques. To address these issues, we propose the Adaptive Rank Allocation framework and introduce the Rank and Neuron Allocator (RaNA) adapter. RaNA adapters leverage rank adapters, which operate on linear layers by applying both low-rank matrix decompositions and adaptive masking to efficiently allocate compute without depending on activation sparsity. This enables RaNA to be generally applied to MLPs and linear components of attention modules, while eliminating the need for expensive maskers found in neuron-adaptive methods. Notably, when compared to neuron adapters, RaNA improves perplexity by up to 7 points and increases accuracy by up to 8 percentage-points when reducing FLOPs by $\sim$44% in state-of-the-art Transformer architectures. These results position RaNA as a robust solution for improving inference efficiency in modern Transformer architectures.

Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters

TL;DR

The paper tackles the rising inference cost of modern Transformers by moving beyond neuron-based adapters and introducing Adaptive Rank Allocation, with RaNA as a concrete adapter. RaNA replaces linear layers with input-dependent low-rank decompositions, enabling computation to be allocated via rank-aware routers without relying on activation sparsity, and extends to both MLP and QKV components of Transformers. Empirically, RaNA achieves lower reconstruction error and better perplexity/accuracy trade-offs than prior adapters across Llama2-7b, Gemma-2b, and Pythia models at around 40–46% FLOP reductions, including practical latency gains. The approach demonstrates robust applicability to non-sparse activations like SwiGLU and GeLU, highlights the sparsity of rank contributions, and suggests that input-driven rank adaptation can be a powerful generalization of neuron adapters with meaningful real-world speedups.

Abstract

Large Language Models (LLMs) are computationally intensive, particularly during inference. Neuron-adaptive techniques, which selectively activate neurons in Multi-Layer Perceptron (MLP) layers, offer some speedups but suffer from limitations in modern Transformers. These include reliance on sparse activations, incompatibility with attention layers, and the use of costly neuron masking techniques. To address these issues, we propose the Adaptive Rank Allocation framework and introduce the Rank and Neuron Allocator (RaNA) adapter. RaNA adapters leverage rank adapters, which operate on linear layers by applying both low-rank matrix decompositions and adaptive masking to efficiently allocate compute without depending on activation sparsity. This enables RaNA to be generally applied to MLPs and linear components of attention modules, while eliminating the need for expensive maskers found in neuron-adaptive methods. Notably, when compared to neuron adapters, RaNA improves perplexity by up to 7 points and increases accuracy by up to 8 percentage-points when reducing FLOPs by 44% in state-of-the-art Transformer architectures. These results position RaNA as a robust solution for improving inference efficiency in modern Transformer architectures.

Paper Structure

This paper contains 16 sections, 2 theorems, 16 equations, 5 figures, 4 tables, 3 algorithms.

Key Result

Proposition 1

Consider an $\text{MLP}(x)_{ReLU}$ layer (Eqn. eqn:mlp-relu) and its neuron adapted version $\text{MLP}'(x)_{ReLU}$ (Eqn. eqn:mlp-relu-adapted). Then, there exists a rank adapted $\text{MLP}^*_{ReLU}$ (i.e. an MLP whose linear layers have been rank adapted) s.t. $\text{MLP}^*_{ReLU}(x) = \text{MLP}'

Figures (5)

  • Figure 1: RaNA improves accuracy-compute tradeoff over neuron adapters. $y$-axis shows accuracy averaged over multiple downstream tasks (Sect. \ref{['sect:exp-setup']}); for Figs. \ref{['fig:llama-curve']} and \ref{['fig:act_acc']}, $x$-axis shows average FLOPs for a forward pass with sequence length 512; for Fig. \ref{['fig:llama-lat-curve']}$x$-axis shows average per-token decoding latency over a sequence of 492 tokens with initial context lengths ranging from 1 to 1000. We compare RaNA-adapted models to neuron-adapted versions at various compression rates for (left) Llama2-7b and (right) Pythia models. Notably, RaNA accuracies decay slower as compression rates increase compared to neuron-adapters.
  • Figure 2: The contribution of ranks in Linear Layer Rank Adapters is sparse for multiple layer types (Sect. \ref{['sect:contrib-sparse']}). Histograms outline the contribution of different column-vectors from the $A$ matrix in the Linear Layer Rank Adapter decomposition $Wx \approx A (m(x) \odot B x)$ to the original layers for Llama2-7b (left) and Gemma-2b (right). Red dashed line indicates a 50% sparsity threshold.
  • Figure 3: RaNA adapters attain lowest errors on Transformer layers when reconstructing original layer outputs.$y$-axis shows the error percentage; $x$-axis shows layer number. Errors induced by different adapters are compared when compressing layers of Llama2-7b, Gemma-2b and Pythia-160M to 50% FLOPs. RaNA attains the lowest error consistently across model layers (Sect. \ref{['sect:rana-evals']}).
  • Figure 4: Perplexities are measured across adapted Pythia models, as a complementary measurement to accuracies outlined in Fig \ref{['fig:act_acc']}. $y$-axis shows perplexity measured over $\sim$300K tokens of the Pile dataset (Sect. \ref{['sect:exp-setup']}); $x$-axis shows average FLOPs for a forward pass with sequence length 512.
  • Figure 5: Llama2-7b Accuracy v.s. FLOPs.

Theorems & Definitions (2)

  • Proposition 1
  • Theorem 1