EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models
Xuchen Pan, Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou
TL;DR
EE-Tuning offers a pragmatic, two-stage method to convert pre-trained decoder-only LLMs into early-exit models by attaching and tuning dedicated exit layers while freezing the backbone. The approach is designed for scalability, leveraging full 3D parallelism and a memory-efficient pipeline that minimizes training costs, demonstrated on up to 70B-parameter models with 1.2–1.6× inference speedups and minimal quality loss. Key innovations include copy-based initialization of exits, diverse exit architectures (MLP, Norm, etc.), plug-and-play deployment, and dynamic token-wise loss weighting. The work provides extensive empirical evidence across model scales and tasks, discusses limitations of frozen-backbone tuning, and offers practical guidance and code to broaden accessibility of efficient early-exit LLMs in real-world settings.
Abstract
This work introduces EE-Tuning, a lightweight and economical solution to training/tuning early-exit large language models (LLMs). In contrast to the common approach of full-parameter pre-training, EE-Tuning augments any pre-trained (and possibly fine-tuned) standard LLM with additional early-exit layers that are tuned in a parameter-efficient manner, which requires significantly less computational resources and training data. Our implementation of EE-Tuning achieves outstanding training efficiency via extensive performance optimizations, as well as scalability due to its full compatibility with 3D parallelism. Results of systematic experiments validate the efficacy of EE-Tuning, confirming that effective early-exit LLM inference can be achieved with a limited training budget. In hope of making early-exit LLMs accessible to the community, we release the source code of our implementation of EE-Tuning at https://github.com/pan-x-c/EE-LLM.
