Table of Contents
Fetching ...

EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models

Xuchen Pan, Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou

TL;DR

EE-Tuning offers a pragmatic, two-stage method to convert pre-trained decoder-only LLMs into early-exit models by attaching and tuning dedicated exit layers while freezing the backbone. The approach is designed for scalability, leveraging full 3D parallelism and a memory-efficient pipeline that minimizes training costs, demonstrated on up to 70B-parameter models with 1.2–1.6× inference speedups and minimal quality loss. Key innovations include copy-based initialization of exits, diverse exit architectures (MLP, Norm, etc.), plug-and-play deployment, and dynamic token-wise loss weighting. The work provides extensive empirical evidence across model scales and tasks, discusses limitations of frozen-backbone tuning, and offers practical guidance and code to broaden accessibility of efficient early-exit LLMs in real-world settings.

Abstract

This work introduces EE-Tuning, a lightweight and economical solution to training/tuning early-exit large language models (LLMs). In contrast to the common approach of full-parameter pre-training, EE-Tuning augments any pre-trained (and possibly fine-tuned) standard LLM with additional early-exit layers that are tuned in a parameter-efficient manner, which requires significantly less computational resources and training data. Our implementation of EE-Tuning achieves outstanding training efficiency via extensive performance optimizations, as well as scalability due to its full compatibility with 3D parallelism. Results of systematic experiments validate the efficacy of EE-Tuning, confirming that effective early-exit LLM inference can be achieved with a limited training budget. In hope of making early-exit LLMs accessible to the community, we release the source code of our implementation of EE-Tuning at https://github.com/pan-x-c/EE-LLM.

EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models

TL;DR

EE-Tuning offers a pragmatic, two-stage method to convert pre-trained decoder-only LLMs into early-exit models by attaching and tuning dedicated exit layers while freezing the backbone. The approach is designed for scalability, leveraging full 3D parallelism and a memory-efficient pipeline that minimizes training costs, demonstrated on up to 70B-parameter models with 1.2–1.6× inference speedups and minimal quality loss. Key innovations include copy-based initialization of exits, diverse exit architectures (MLP, Norm, etc.), plug-and-play deployment, and dynamic token-wise loss weighting. The work provides extensive empirical evidence across model scales and tasks, discusses limitations of frozen-backbone tuning, and offers practical guidance and code to broaden accessibility of efficient early-exit LLMs in real-world settings.

Abstract

This work introduces EE-Tuning, a lightweight and economical solution to training/tuning early-exit large language models (LLMs). In contrast to the common approach of full-parameter pre-training, EE-Tuning augments any pre-trained (and possibly fine-tuned) standard LLM with additional early-exit layers that are tuned in a parameter-efficient manner, which requires significantly less computational resources and training data. Our implementation of EE-Tuning achieves outstanding training efficiency via extensive performance optimizations, as well as scalability due to its full compatibility with 3D parallelism. Results of systematic experiments validate the efficacy of EE-Tuning, confirming that effective early-exit LLM inference can be achieved with a limited training budget. In hope of making early-exit LLMs accessible to the community, we release the source code of our implementation of EE-Tuning at https://github.com/pan-x-c/EE-LLM.
Paper Structure (38 sections, 1 equation, 17 figures, 3 tables)

This paper contains 38 sections, 1 equation, 17 figures, 3 tables.

Figures (17)

  • Figure 1: An outline of EE-Tuning, the proposed two-stage procedure that converts a pre-trained standard LLM into a well-trained early-exit LLM.
  • Figure 2: A visualization of various early-exit architectures. Each attention or MLP module follows the residual structure with pre-normalization.
  • Figure 3: One training iteration of our customized pipeline schedule used in EE-Tuning, in a setting with 4 pipeline stages and 8 microbatches indexed by numbers in the blocks.
  • Figure 4: Training losses of all early exits at the end of EE-Tuning for various early-exit architectures.
  • Figure 5: Downstream performance of our 13B models with various early-exit architectures. Points closer to the top-right corner represent better performance (i.e. higher speedup and scores). Markers on each curve correspond to discrete values of the confidence threshold that we use for this experiment. Speedup increases from left to right as the threshold decreases, taking values in $\{1.0, 0.9, 0.8, 0.6, 0.4, 0.2\}$.
  • ...and 12 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2