HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar, Shashank Nag, Jason Clemons, Lizy John, Poulami Das
TL;DR
HELIOS tackles the bottlenecks of single-model EE-LLM serving by orchestrating multiple models and greedily loading only the layers most likely to be used, guided by real-time profiling of early-exit distributions. By exploiting complementary exit patterns across models and the tendency for low-confidence tokens to remain unchanged after additional layers, HELIOS achieves higher throughput and larger batch sizes while preserving accuracy. The framework uses a model repository, real-time evaluation, and a confidence-breach-based adaptation strategy to minimize overheads from model switching and partial loading. Empirical results show substantial gains in throughput (about 1.48–2.13x), memory savings (up to ~67%), and batch size (up to ~15x) across diverse tasks and SLOs, with negligible accuracy impact. These findings suggest HELIOS as a practical, scalable approach to efficient LLM inference serving in production settings.
Abstract
Early-Exit Large Language Models (EE-LLMs) enable high throughput inference by allowing tokens to exit early at intermediate layers. However, their throughput is limited by the computational and memory savings. Existing EE-LLM frameworks rely on a single model and therefore, their token generation latencies are bottlenecked by tokens that do not exit early and traverse additional layers. Moreover, early exits are only known at runtime and depend on the request. Therefore, these frameworks load the weights of all model layers even though large portions remain unused when tokens exit early. The lack of memory savings limit us from scaling the batch sizes. We propose $\textit{HELIOS}$, a framework that improves both token generation latency and batch sizes to enable high-throughput in EE-LLMs. HELIOS exploits two insights. $\textit{First}$, early exits are often complimentary across models, tokens that do not exit early on one model often take an early-exit on another. HELIOS employs multiple models and dynamically switches between them to collectively maximize the number of tokens that exit early, and minimize token generation latencies. $\textit{Second}$, even when a predicted token does not exit early due to poor confidence, it often remains unchanged even after additional layer traversal. HELIOS greedily allows such tokens to exit early and only loads the weights of the most likely to be used layers, yielding memory savings which is then re-purposed to increase batch sizes. HELIOS employs real-time profiling to accurately identify the early-exit distributions, and adaptively switches between models by tracking tokens in real-time to minimize the performance degradation caused by greedy model loading and exiting. Our evaluations show that HELIOS achieves $1.48\times$ higher throughput and $15.14\times$ larger batch size compared to existing EE-LLM frameworks.
