LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation
Xuan Zhang, Fengzhuo Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin
TL;DR
LightTransfer presents a practical approach to convert pretrained transformers into hybrid models by identifying and transforming lazy attention layers into streaming attention, significantly reducing KV-cache memory with minimal disruption to long-context understanding. The framework offers a test-time variant (LightTransfer-Test) requiring no training and a training variant (LightTransfer-Train) using ~5K samples for robust long-reasoning performance, backed by a theoretical error bound. Empirically, it achieves up to 2.17x throughput with as little as ~1.5% performance loss on long-context benchmarks and maintains competitive accuracy on advanced long-reasoning tasks like AIME24, demonstrating effective, scalable deployment with minimal retraining. These results highlight the practical potential of layered hybridization to enable efficient, long-horizon generation on large pretrained backbones.
Abstract
Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation. In this work, we propose LightTransfer, a lightweight method that transforms models such as LLaMA into hybrid variants. Our approach identifies lazy layers -- those focusing on recent or initial tokens -- and replaces their full attention with streaming attention. This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities. Experiments across diverse benchmarks and models (e.g., LLaMA, Mistral, QwQ-STILL) demonstrate that, even when half of the layers are identified as lazy, LightTransfer achieves up to 2.17$\times$ throughput improvement with minimal performance loss ($<1.5\%$ on LongBench) and achieves 53.3\% on math benchmark AIME24 of advanced o1-like long reasoning model QwQ-STILL.
