Table of Contents
Fetching ...

Lizard: An Efficient Linearization Framework for Large Language Models

Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen

TL;DR

Lizard introduces a subquadratic framework for transforming pretrained Transformer LLMs by replacing softmax attention with a hybrid gated linear attention and anchor window attention, augmented by a learnable gating module for adaptive memory and length generalization. The training proceeds in two stages: a RoPE-free attention approximation to recover the teacher's patterns, followed by LM-fine-tuning with Lizard modules, plus a hardware-aware reparameterization that stabilizes low-precision training and leverages standard GEMMs for efficiency. Empirically, Lizard achieves near-lossless recovery of teacher performance, with substantial gains on 5-shot MMLU (up to 9.4–24.5 points) and improved associative recall, while offering up to 32% higher training throughput and constant memory during long-context generation. These results demonstrate that subquadratic attention can deliver practical long-context capability without fully sacrificing model quality, enabling efficient deployment of large LLMs with infinite context.

Abstract

We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures. Transformers faces severe computational and memory bottlenecks with long sequences due to the quadratic complexity of softmax attention and the growing Key-Value (KV) cache that makes inference memory-bound by context length. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving model quality. Unlike prior linearization methods constrained by fixed, non-adaptive structures, Lizard augments the architecture with compact, learnable modules that enable adaptive memory control and robust length generalization. Moreover, we introduce a hardwareaware algorithm that solves numerical instability in gated attention to accelerate training. Extensive experiments show that Lizard achieves near-lossless recovery of its teacher model's performance, significantly outperforming previous methods by up to 9.4 - 24.5 points on the 5-shot MMLU benchmark and demonstrating superior associative recall.

Lizard: An Efficient Linearization Framework for Large Language Models

TL;DR

Lizard introduces a subquadratic framework for transforming pretrained Transformer LLMs by replacing softmax attention with a hybrid gated linear attention and anchor window attention, augmented by a learnable gating module for adaptive memory and length generalization. The training proceeds in two stages: a RoPE-free attention approximation to recover the teacher's patterns, followed by LM-fine-tuning with Lizard modules, plus a hardware-aware reparameterization that stabilizes low-precision training and leverages standard GEMMs for efficiency. Empirically, Lizard achieves near-lossless recovery of teacher performance, with substantial gains on 5-shot MMLU (up to 9.4–24.5 points) and improved associative recall, while offering up to 32% higher training throughput and constant memory during long-context generation. These results demonstrate that subquadratic attention can deliver practical long-context capability without fully sacrificing model quality, enabling efficient deployment of large LLMs with infinite context.

Abstract

We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures. Transformers faces severe computational and memory bottlenecks with long sequences due to the quadratic complexity of softmax attention and the growing Key-Value (KV) cache that makes inference memory-bound by context length. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving model quality. Unlike prior linearization methods constrained by fixed, non-adaptive structures, Lizard augments the architecture with compact, learnable modules that enable adaptive memory control and robust length generalization. Moreover, we introduce a hardwareaware algorithm that solves numerical instability in gated attention to accelerate training. Extensive experiments show that Lizard achieves near-lossless recovery of its teacher model's performance, significantly outperforming previous methods by up to 9.4 - 24.5 points on the 5-shot MMLU benchmark and demonstrating superior associative recall.

Paper Structure

This paper contains 18 sections, 11 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: An overview of the Lizard Attention architecture. Lizard replaces standard attention with a hybrid mechanism that combines Gated Linear Attention (top) for global context compression and Anchor Window Attention (bottom) for local precision. The components highlighted in red the feature maps ($\phi$), the gating module ($W_\gamma$), and the meta-memory tokens ($t$) represent the compact, learnable modules that are augmented to the teacher architecture.
  • Figure 2: Needle-in-a-Haystack evaluation. Each cell shows retrieval accuracy by sequence length (X-axis) and target distance (Y-axis). Green indicates success; red indicates failure. The white dashed line marks the max training length.
  • Figure 3: Example from the synthetic passkey retrieval dataset.
  • Figure 4: Throughput and memory comparison.
  • Figure 5: Inference speed comparison between GLA and Lizard kernel..