Table of Contents
Fetching ...

When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

Haoran You, Yichao Fu, Zheng Wang, Amir Yazdanbakhsh, Yingyan Celine Lin

TL;DR

The paper analyzes the efficacy of linear attention for autoregressive LLMs and its compatibility with speculative decoding, revealing that naïvely applying encoder-focused LAs to autoregressive decoders underperforms due to temporal dependencies and information leakage. It proposes a causal, masked DWConv augmentation with grouped attention to improve locality while preserving causality, and introduces an unfolded DWConv approach to align with tree-based speculative decoding. Across multiple LLMs and long-context tasks, augmented LAs deliver up to a 6.67 perplexity reduction and up to 2x generation speedups, while enabling longer sequence lengths (e.g., 32K). These results establish a practical pathway for more efficient training and deployment of autoregressive LLMs in long-context scenarios.

Abstract

Autoregressive Large Language Models (LLMs) have achieved impressive performance in language tasks but face two significant bottlenecks: (1) quadratic complexity in the attention module as the number of tokens increases, and (2) limited efficiency due to the sequential processing nature of autoregressive LLMs during generation. While linear attention and speculative decoding offer potential solutions, their applicability and synergistic potential for enhancing autoregressive LLMs remain uncertain. We conduct the first comprehensive study on the efficacy of existing linear attention methods for autoregressive LLMs, integrating them with speculative decoding. We introduce an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs. Extensive experiments and ablation studies involving seven existing linear attention models and five encoder/decoder-based LLMs consistently validate the effectiveness of our augmented linearized LLMs. Notably, our approach achieves up to a 6.67 reduction in perplexity on the LLaMA model and up to a 2$\times$ speedup during generation compared to prior linear attention methods. Codes and models are available at https://github.com/GATECH-EIC/Linearized-LLM.

When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

TL;DR

The paper analyzes the efficacy of linear attention for autoregressive LLMs and its compatibility with speculative decoding, revealing that naïvely applying encoder-focused LAs to autoregressive decoders underperforms due to temporal dependencies and information leakage. It proposes a causal, masked DWConv augmentation with grouped attention to improve locality while preserving causality, and introduces an unfolded DWConv approach to align with tree-based speculative decoding. Across multiple LLMs and long-context tasks, augmented LAs deliver up to a 6.67 perplexity reduction and up to 2x generation speedups, while enabling longer sequence lengths (e.g., 32K). These results establish a practical pathway for more efficient training and deployment of autoregressive LLMs in long-context scenarios.

Abstract

Autoregressive Large Language Models (LLMs) have achieved impressive performance in language tasks but face two significant bottlenecks: (1) quadratic complexity in the attention module as the number of tokens increases, and (2) limited efficiency due to the sequential processing nature of autoregressive LLMs during generation. While linear attention and speculative decoding offer potential solutions, their applicability and synergistic potential for enhancing autoregressive LLMs remain uncertain. We conduct the first comprehensive study on the efficacy of existing linear attention methods for autoregressive LLMs, integrating them with speculative decoding. We introduce an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs. Extensive experiments and ablation studies involving seven existing linear attention models and five encoder/decoder-based LLMs consistently validate the effectiveness of our augmented linearized LLMs. Notably, our approach achieves up to a 6.67 reduction in perplexity on the LLaMA model and up to a 2 speedup during generation compared to prior linear attention methods. Codes and models are available at https://github.com/GATECH-EIC/Linearized-LLM.
Paper Structure (20 sections, 1 equation, 10 figures, 16 tables)

This paper contains 20 sections, 1 equation, 10 figures, 16 tables.

Figures (10)

  • Figure 1: Empirical evaluation of seven linear attention methods on top of three types of LLMs on the GLUE wang2018glue benchmark: (1) encoder-based BERT devlin2018bert; (2) decoder-based GPT-2 gpt2; and (3) encoder-decoder T5 roberts2022t5x. Left: The majority of SOTA linear attentions, including LinFormer wang2020linformer, TransNormertransnormer, FLASH-Local flash, and YOSO yoso, exhibit superior performance on encoder-based models compared to decoder-based ones. Right: Other linear attention methods, such as ReLU-based one cai2023efficientvit, Performer performers, and FLASH-Global flash, consistently perform less effectively on all LLMs.
  • Figure 2: Runtime profiling: (a) actual runtime latencies for both the softmax and the entire model; (b) the percentage of time allocated to softmax computations across the latency of the entire model. All data were collected using BERT-Base/Large models on a single A5000 or A100 GPU.
  • Figure 3: Illustrating the autoregressive LLMs. The process of generating text unfolds in two stages: (1) an initial summarization or prefill phase that employs a large batch size and utilizes the given input context; followed by (2) the generation or decode phase, which operates on a single-batch basis, using previously generated tokens to continue the text output.
  • Figure 4: Existing augmented LAs fail in autoregressive LLMs. Left: The augmented DWConv branch results in zero loss/accuracy, as indicated by the yellow line. Right: Illustration of the information leakage phenomenon, i.e., next tokens are prematurely revealed as shown by red arrows, in autoregressive LLMs with DWConv in the $\mathbf{V}$ branch.
  • Figure 5: Model architecture of our LA augmentation.
  • ...and 5 more figures