Table of Contents
Fetching ...

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

Xuan Zhang, Fengzhuo Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin

TL;DR

LightTransfer presents a practical approach to convert pretrained transformers into hybrid models by identifying and transforming lazy attention layers into streaming attention, significantly reducing KV-cache memory with minimal disruption to long-context understanding. The framework offers a test-time variant (LightTransfer-Test) requiring no training and a training variant (LightTransfer-Train) using ~5K samples for robust long-reasoning performance, backed by a theoretical error bound. Empirically, it achieves up to 2.17x throughput with as little as ~1.5% performance loss on long-context benchmarks and maintains competitive accuracy on advanced long-reasoning tasks like AIME24, demonstrating effective, scalable deployment with minimal retraining. These results highlight the practical potential of layered hybridization to enable efficient, long-horizon generation on large pretrained backbones.

Abstract

Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation. In this work, we propose LightTransfer, a lightweight method that transforms models such as LLaMA into hybrid variants. Our approach identifies lazy layers -- those focusing on recent or initial tokens -- and replaces their full attention with streaming attention. This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities. Experiments across diverse benchmarks and models (e.g., LLaMA, Mistral, QwQ-STILL) demonstrate that, even when half of the layers are identified as lazy, LightTransfer achieves up to 2.17$\times$ throughput improvement with minimal performance loss ($<1.5\%$ on LongBench) and achieves 53.3\% on math benchmark AIME24 of advanced o1-like long reasoning model QwQ-STILL.

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

TL;DR

LightTransfer presents a practical approach to convert pretrained transformers into hybrid models by identifying and transforming lazy attention layers into streaming attention, significantly reducing KV-cache memory with minimal disruption to long-context understanding. The framework offers a test-time variant (LightTransfer-Test) requiring no training and a training variant (LightTransfer-Train) using ~5K samples for robust long-reasoning performance, backed by a theoretical error bound. Empirically, it achieves up to 2.17x throughput with as little as ~1.5% performance loss on long-context benchmarks and maintains competitive accuracy on advanced long-reasoning tasks like AIME24, demonstrating effective, scalable deployment with minimal retraining. These results highlight the practical potential of layered hybridization to enable efficient, long-horizon generation on large pretrained backbones.

Abstract

Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation. In this work, we propose LightTransfer, a lightweight method that transforms models such as LLaMA into hybrid variants. Our approach identifies lazy layers -- those focusing on recent or initial tokens -- and replaces their full attention with streaming attention. This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities. Experiments across diverse benchmarks and models (e.g., LLaMA, Mistral, QwQ-STILL) demonstrate that, even when half of the layers are identified as lazy, LightTransfer achieves up to 2.17 throughput improvement with minimal performance loss ( on LongBench) and achieves 53.3\% on math benchmark AIME24 of advanced o1-like long reasoning model QwQ-STILL.

Paper Structure

This paper contains 29 sections, 6 theorems, 34 equations, 9 figures, 5 tables.

Key Result

Theorem 5.1

If the Frobenius norms of all the parameters in a $L$-layer with $H$-attention heads transformer are upper bounded by $B$ and the activation function is $L_{{\mathsf{lip}}}$-Lipschitz, then we have that If we denote the error of hidden states at layer $i$ as $e_{i}$, then it evolves as where $C_{1}$ and $C_{2}$ are quantities related to $B$, $H$ and $L_{{\mathsf{lip}}}$.

Figures (9)

  • Figure 1: (a) A standard transformer architecture. (b) A hybrid model in which certain layers of a standard transformer are replaced with more memory-efficient designs. LightTransfer identifies lazy layers in (a) and transforms them into more efficient variants, yielding (b).
  • Figure 2: Visualization of attention weight distributions on LLaMA3-8B. Left: The attention patterns across different layers. Right: Each cell represents an attention weight from each token (x-axis) to the initial tokens and the most recent tokens during both the prefilling and decoding stages. Layers that predominantly attend to these tokens are outlined in black boxes.
  • Figure 3: The framework of our LightTransfer-Test. A priority queue is maintained during the prefilling stage to store the lazy ratio and corresponding layer index after processing each layer. Once the queue reaches its capacity, the layer with the highest lazy ratio is identified as a lazy layer, and its KV cache is reduced, freeing memory for storing the KV cache of the current layer.
  • Figure 4: Performance comparison of LightTransfer and standard model on NIAH tasks using Mistral-7B-Instruct.
  • Figure 5: Lazy ratio scores across layers in QwQ-32B-STILL.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Theorem 5.1: Informal
  • Theorem 6.3
  • proof : Proof of Theorem \ref{['thm:err_analysis']}
  • Lemma 7.1: Corollary A.7 in edelman2022inductive
  • Lemma 7.2: Lemma 17 in zhang2022relational
  • Lemma 7.3: Lemma I.8 in zhang2023and
  • Lemma 7.4
  • proof : Proof of Lemma \ref{['lem:sink']}