Table of Contents
Fetching ...

HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving

Avinash Kumar, Shashank Nag, Jason Clemons, Lizy John, Poulami Das

TL;DR

HELIOS tackles the bottlenecks of single-model EE-LLM serving by orchestrating multiple models and greedily loading only the layers most likely to be used, guided by real-time profiling of early-exit distributions. By exploiting complementary exit patterns across models and the tendency for low-confidence tokens to remain unchanged after additional layers, HELIOS achieves higher throughput and larger batch sizes while preserving accuracy. The framework uses a model repository, real-time evaluation, and a confidence-breach-based adaptation strategy to minimize overheads from model switching and partial loading. Empirical results show substantial gains in throughput (about 1.48–2.13x), memory savings (up to ~67%), and batch size (up to ~15x) across diverse tasks and SLOs, with negligible accuracy impact. These findings suggest HELIOS as a practical, scalable approach to efficient LLM inference serving in production settings.

Abstract

Early-Exit Large Language Models (EE-LLMs) enable high throughput inference by allowing tokens to exit early at intermediate layers. However, their throughput is limited by the computational and memory savings. Existing EE-LLM frameworks rely on a single model and therefore, their token generation latencies are bottlenecked by tokens that do not exit early and traverse additional layers. Moreover, early exits are only known at runtime and depend on the request. Therefore, these frameworks load the weights of all model layers even though large portions remain unused when tokens exit early. The lack of memory savings limit us from scaling the batch sizes. We propose $\textit{HELIOS}$, a framework that improves both token generation latency and batch sizes to enable high-throughput in EE-LLMs. HELIOS exploits two insights. $\textit{First}$, early exits are often complimentary across models, tokens that do not exit early on one model often take an early-exit on another. HELIOS employs multiple models and dynamically switches between them to collectively maximize the number of tokens that exit early, and minimize token generation latencies. $\textit{Second}$, even when a predicted token does not exit early due to poor confidence, it often remains unchanged even after additional layer traversal. HELIOS greedily allows such tokens to exit early and only loads the weights of the most likely to be used layers, yielding memory savings which is then re-purposed to increase batch sizes. HELIOS employs real-time profiling to accurately identify the early-exit distributions, and adaptively switches between models by tracking tokens in real-time to minimize the performance degradation caused by greedy model loading and exiting. Our evaluations show that HELIOS achieves $1.48\times$ higher throughput and $15.14\times$ larger batch size compared to existing EE-LLM frameworks.

HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving

TL;DR

HELIOS tackles the bottlenecks of single-model EE-LLM serving by orchestrating multiple models and greedily loading only the layers most likely to be used, guided by real-time profiling of early-exit distributions. By exploiting complementary exit patterns across models and the tendency for low-confidence tokens to remain unchanged after additional layers, HELIOS achieves higher throughput and larger batch sizes while preserving accuracy. The framework uses a model repository, real-time evaluation, and a confidence-breach-based adaptation strategy to minimize overheads from model switching and partial loading. Empirical results show substantial gains in throughput (about 1.48–2.13x), memory savings (up to ~67%), and batch size (up to ~15x) across diverse tasks and SLOs, with negligible accuracy impact. These findings suggest HELIOS as a practical, scalable approach to efficient LLM inference serving in production settings.

Abstract

Early-Exit Large Language Models (EE-LLMs) enable high throughput inference by allowing tokens to exit early at intermediate layers. However, their throughput is limited by the computational and memory savings. Existing EE-LLM frameworks rely on a single model and therefore, their token generation latencies are bottlenecked by tokens that do not exit early and traverse additional layers. Moreover, early exits are only known at runtime and depend on the request. Therefore, these frameworks load the weights of all model layers even though large portions remain unused when tokens exit early. The lack of memory savings limit us from scaling the batch sizes. We propose , a framework that improves both token generation latency and batch sizes to enable high-throughput in EE-LLMs. HELIOS exploits two insights. , early exits are often complimentary across models, tokens that do not exit early on one model often take an early-exit on another. HELIOS employs multiple models and dynamically switches between them to collectively maximize the number of tokens that exit early, and minimize token generation latencies. , even when a predicted token does not exit early due to poor confidence, it often remains unchanged even after additional layer traversal. HELIOS greedily allows such tokens to exit early and only loads the weights of the most likely to be used layers, yielding memory savings which is then re-purposed to increase batch sizes. HELIOS employs real-time profiling to accurately identify the early-exit distributions, and adaptively switches between models by tracking tokens in real-time to minimize the performance degradation caused by greedy model loading and exiting. Our evaluations show that HELIOS achieves higher throughput and larger batch size compared to existing EE-LLM frameworks.

Paper Structure

This paper contains 37 sections, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: (a) Memory required to store model weights and (b) throughput of CodeLlama-34B and Llama2-70B models for vanilla auto-regressive decoding, EE-LLM, and HELIOS on ShareGPT sharegpt. By using multiple LLMs to maximize early exits across and greedily loading weights of most likely to be used layers, HELIOS reduces both token generation latencies and memory footprint. The memory savings lead to higher batch sizes and overall, HELIOS improves throughput by 45%, unlike EE-LLMs that only improve it by 16% relative to vanilla decoding.
  • Figure 2: (a) Current EE-LLMs select a model (say $M_1$), load weights of all its layers, and use a batch size of $1$ to avoid synchronization across tokens. (b) HELIOS uses multiple LLMs ($M_1$ and $M_2$ here) and only loads the weights of the layers most likely to be used based on real-time early exit profiles. HELIOS improves batch sizes by increasing available memory capacity and reducing synchronization overheads. HELIOS also monitors performance in real-time and switches between LLMs or loads additional layers of the current model to prevent accuracy degradation.
  • Figure 3: Exit layers for serving a typical workload with a mixture of prompts using OPT 1.3B and 6.7B models.
  • Figure 4: Design of HELIOS
  • Figure 5: (a) Fraction of unchanged tokens for four datasets on OPT-6.7B model from the 1st exit layer (9) to the final layer (32). We observe that probability of the predicted token staying unchanged is always greater than 85%. (b) Fraction of tokens generated at an exit-layer that remain unchanged even after traversing the full-model for the Codellama-34B model across six datasets.
  • ...and 10 more figures