Table of Contents
Fetching ...

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou

TL;DR

EE-LLM tackles the high cost and latency of training and deploying large language models by enabling large-scale training and inference of early-exit LLMs using 3D parallelism built atop Megatron-LM. It introduces a lightweight backpropagation mechanism for multi-exit objectives across pipeline stages, along with strategies that exploit idle resources and pipeline bubbles to minimize training overhead. For inference, EE-LLM provides two KV-caching–compatible approaches (KV recomputation and pipeline-based parallel decoding) to realize token-wise adaptive exits without sacrificing autoregressive generation quality. Empirically, EE-LLM achieves training efficiency close to standard LLMs with negligible overhead and delivers substantial inference speedups (often 2x) for EE-LLMs up to 30B parameters on multi-node GPU clusters, while maintaining comparable evaluation performance. The work includes code release and demonstrates that early exiting can be a practical option at scale, with potential applicability to broader architectures and hardware.

Abstract

We present EE-LLM, a framework for large-scale training and inference of early-exit large language models (LLMs). While recent works have shown preliminary evidence for the efficacy of early exiting in accelerating LLM inference, EE-LLM makes a foundational step towards scaling up early-exit LLMs by supporting their training and inference with massive 3D parallelism. Built upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and performance optimizations tailored to early exiting, including a lightweight method that facilitates backpropagation for the early-exit training objective with pipeline parallelism, techniques of leveraging idle resources in the original pipeline schedule for computation related to early-exit layers, and two approaches of early-exit inference that are compatible with KV caching for autoregressive generation. Our analytical and empirical study shows that EE-LLM achieves great training efficiency with negligible computational overhead compared to standard LLM training, as well as outstanding inference speedup without compromising output quality. To facilitate further research and adoption, we release EE-LLM at https://github.com/pan-x-c/EE-LLM.

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

TL;DR

EE-LLM tackles the high cost and latency of training and deploying large language models by enabling large-scale training and inference of early-exit LLMs using 3D parallelism built atop Megatron-LM. It introduces a lightweight backpropagation mechanism for multi-exit objectives across pipeline stages, along with strategies that exploit idle resources and pipeline bubbles to minimize training overhead. For inference, EE-LLM provides two KV-caching–compatible approaches (KV recomputation and pipeline-based parallel decoding) to realize token-wise adaptive exits without sacrificing autoregressive generation quality. Empirically, EE-LLM achieves training efficiency close to standard LLMs with negligible overhead and delivers substantial inference speedups (often 2x) for EE-LLMs up to 30B parameters on multi-node GPU clusters, while maintaining comparable evaluation performance. The work includes code release and demonstrates that early exiting can be a practical option at scale, with potential applicability to broader architectures and hardware.

Abstract

We present EE-LLM, a framework for large-scale training and inference of early-exit large language models (LLMs). While recent works have shown preliminary evidence for the efficacy of early exiting in accelerating LLM inference, EE-LLM makes a foundational step towards scaling up early-exit LLMs by supporting their training and inference with massive 3D parallelism. Built upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and performance optimizations tailored to early exiting, including a lightweight method that facilitates backpropagation for the early-exit training objective with pipeline parallelism, techniques of leveraging idle resources in the original pipeline schedule for computation related to early-exit layers, and two approaches of early-exit inference that are compatible with KV caching for autoregressive generation. Our analytical and empirical study shows that EE-LLM achieves great training efficiency with negligible computational overhead compared to standard LLM training, as well as outstanding inference speedup without compromising output quality. To facilitate further research and adoption, we release EE-LLM at https://github.com/pan-x-c/EE-LLM.
Paper Structure (66 sections, 2 theorems, 29 equations, 12 figures, 4 tables)

This paper contains 66 sections, 2 theorems, 29 equations, 12 figures, 4 tables.

Key Result

Proposition 3.1

Suppose that there is no tied parameter across pipeline stages, and consider the auxiliary losses defined in Eq. eq:def_auxiliary_loss. Then, for any $i \in [K]$ and any model parameter or activation tensor $\bm{z}$ in Stage $i$, it holds that

Figures (12)

  • Figure 1: The model architecture of an early-exit LLM. Additional components compared to a standard LLM are highlighted in blue. Each $\bm{\theta}_i$ represents a sequence of Transformer layers in the backbone of the LLM, with some additional modules in $\bm{\theta}_1$ for input processing. Each $\bm{\phi}_i$ represents an early or final-exit layer that converts hidden states $\bm{x}_i$ into output $\bm{o}_i$, e.g. logits for next-token prediction.
  • Figure 2: The backpropagation process for an early-exit model partitioned into four pipeline stages.
  • Figure 3: One iteration of the 1F1B pipeline schedule, in a setting with $P=4$ pipeline stages and $M=6$ microbatches per batch. At the top of this figure, "Backbone forward/backward" stands for computation of Transformer layers on the backbone, while "Exit forward/backward" stands for computation of early-exit or final-exit layers. The number in each block denotes the index of the corresponding microbatch. Critical paths are marked by dashed red lines. From Figure (a) to (b), additional "Exit forward/backward" blocks are added, due to the introduction of early exits to middle stages. From Figure (b) to (c), the order of computation is slightly adjusted for the purpose of reducing memory usage. For clarity, we ignore computation related to the input embedding layer, and P2P communication latency between pipeline stages.
  • Figure 4: The proposed method of filling bubbles with additional microbatches. In this example, P1 and P2 go through the forward and backward passes for the first few stages, while P3 and P4 go through the full forward pass, followed by the backward pass for the last few stages.
  • Figure 5: Standard full-model inference (top) and our pipeline-based early-exit inference (bottom). Numbers in the blocks denote the tokens within one generated sequence. For simplicity of visualization, we assume here that (1) each early exit is located at the end of some pipeline stage, and (2) the latency for generating each token is the same (while in practice, generating the first token via the prefilling phase usually takes longer than generating another token during the decoding phase).
  • ...and 7 more figures

Theorems & Definitions (8)

  • Proposition 3.1
  • proof : Proof of Proposition \ref{['prop:auxiliary_loss_bp']}
  • Remark 1.1
  • Remark 1.2
  • Remark 1.3
  • Claim 3.1: Informal
  • Proposition 3.2
  • proof : Proof of Proposition \ref{['prop:variance']}