EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou
TL;DR
EE-LLM tackles the high cost and latency of training and deploying large language models by enabling large-scale training and inference of early-exit LLMs using 3D parallelism built atop Megatron-LM. It introduces a lightweight backpropagation mechanism for multi-exit objectives across pipeline stages, along with strategies that exploit idle resources and pipeline bubbles to minimize training overhead. For inference, EE-LLM provides two KV-caching–compatible approaches (KV recomputation and pipeline-based parallel decoding) to realize token-wise adaptive exits without sacrificing autoregressive generation quality. Empirically, EE-LLM achieves training efficiency close to standard LLMs with negligible overhead and delivers substantial inference speedups (often 2x) for EE-LLMs up to 30B parameters on multi-node GPU clusters, while maintaining comparable evaluation performance. The work includes code release and demonstrates that early exiting can be a practical option at scale, with potential applicability to broader architectures and hardware.
Abstract
We present EE-LLM, a framework for large-scale training and inference of early-exit large language models (LLMs). While recent works have shown preliminary evidence for the efficacy of early exiting in accelerating LLM inference, EE-LLM makes a foundational step towards scaling up early-exit LLMs by supporting their training and inference with massive 3D parallelism. Built upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and performance optimizations tailored to early exiting, including a lightweight method that facilitates backpropagation for the early-exit training objective with pipeline parallelism, techniques of leveraging idle resources in the original pipeline schedule for computation related to early-exit layers, and two approaches of early-exit inference that are compatible with KV caching for autoregressive generation. Our analytical and empirical study shows that EE-LLM achieves great training efficiency with negligible computational overhead compared to standard LLM training, as well as outstanding inference speedup without compromising output quality. To facilitate further research and adoption, we release EE-LLM at https://github.com/pan-x-c/EE-LLM.
