Table of Contents
Fetching ...

Dynamic Rebatching for Efficient Early-Exit Inference with DREX

Xuting Liu, Daniel Alexander, Siva Kesava Reddy Kakarla, Behnaz Arzani, Vincent Liu

TL;DR

This paper tackles the inefficiencies of applying Early Exiting (EE) to batched LLM inference by introducing Dynamic Rebatching through the DREX system. It combines a copy-free rebatching buffer with memory-efficient KV-cache handling and a principled Adaptive Rebatching Threshold (ART) plus SLA-aware scheduling to preserve output quality while boosting throughput. Empirical results show 2-12% throughput gains over baselines, zero involuntary exits, and notable memory savings, with ART and SLA scheduling further improving responsiveness under SLA pressure. The work delivers an open-source, end-to-end solution that makes batched EE practical for production LLM serving.

Abstract

Early-Exit (EE) is a Large Language Model (LLM) architecture that accelerates inference by allowing easier tokens to be generated using only a subset of the model's layers. However, traditional batching frameworks are ill-suited for EE LLMs, as not all requests in a batch may be ready to exit at the same time. Existing solutions either force a uniform decision on the batch, which overlooks EE opportunities, or degrade output quality by forcing premature exits. We propose Dynamic Rebatching, a solution where we dynamically reorganize the batch at each early-exit point. Requests that meet the exit criteria are immediately processed, while those that continue are held in a buffer, re-grouped into a new batch, and forwarded to deeper layers. We introduce DREX, an early-exit inference system that implements Dynamic Rebatching with two key optimizations: 1) a copy-free rebatching buffer that avoids physical data movement, and 2) an EE and SLA-aware scheduler that analytically predicts whether a given rebatching operation will be profitable. DREX also efficiently handles the missing KV cache from skipped layers using memory-efficient state-copying. Our evaluation shows that DREX improves throughput by 2-12% compared to baseline approaches while maintaining output quality. Crucially, DREX completely eliminates involuntary exits, providing a key guarantee for preserving the output quality intended by the EE model.

Dynamic Rebatching for Efficient Early-Exit Inference with DREX

TL;DR

This paper tackles the inefficiencies of applying Early Exiting (EE) to batched LLM inference by introducing Dynamic Rebatching through the DREX system. It combines a copy-free rebatching buffer with memory-efficient KV-cache handling and a principled Adaptive Rebatching Threshold (ART) plus SLA-aware scheduling to preserve output quality while boosting throughput. Empirical results show 2-12% throughput gains over baselines, zero involuntary exits, and notable memory savings, with ART and SLA scheduling further improving responsiveness under SLA pressure. The work delivers an open-source, end-to-end solution that makes batched EE practical for production LLM serving.

Abstract

Early-Exit (EE) is a Large Language Model (LLM) architecture that accelerates inference by allowing easier tokens to be generated using only a subset of the model's layers. However, traditional batching frameworks are ill-suited for EE LLMs, as not all requests in a batch may be ready to exit at the same time. Existing solutions either force a uniform decision on the batch, which overlooks EE opportunities, or degrade output quality by forcing premature exits. We propose Dynamic Rebatching, a solution where we dynamically reorganize the batch at each early-exit point. Requests that meet the exit criteria are immediately processed, while those that continue are held in a buffer, re-grouped into a new batch, and forwarded to deeper layers. We introduce DREX, an early-exit inference system that implements Dynamic Rebatching with two key optimizations: 1) a copy-free rebatching buffer that avoids physical data movement, and 2) an EE and SLA-aware scheduler that analytically predicts whether a given rebatching operation will be profitable. DREX also efficiently handles the missing KV cache from skipped layers using memory-efficient state-copying. Our evaluation shows that DREX improves throughput by 2-12% compared to baseline approaches while maintaining output quality. Crucially, DREX completely eliminates involuntary exits, providing a key guarantee for preserving the output quality intended by the EE model.

Paper Structure

This paper contains 21 sections, 7 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Conceptual diagram of an LLM with EE. Each token is a separate iteration. The sequence exits early in the 2nd and 4th iterations (when outputting the tokens "is" and "EOS.")
  • Figure 2: The challenges of operationalizing \ref{['fig:earlyexit']}.
  • Figure 3: Early Exiting (EE) provides a significant throughput boost of up to 33% for non-batched inference on SOTA frameworks (EE-LLM eellm), Apparate apparate2024, and Miao et al. miao2024efficient). However, this advantage diminishes with batching, where the same EE methods offer only marginal gains of less than 2% and can sometimes even reduce throughput.
  • Figure 4: Comparison of throughput and KV cache size for Llama-EE-13B generating 4000 tokens at a batch size of 1. Both EE-LLM and DREX use state-copying. DREX employs virtual copying to reduce memory consumption. A lower early-exit (EE) threshold allows for more exits, increasing throughput and but also duplicated KV cache entries in EE-LLM.
  • Figure 5: System architecture. DREX executes these 5 steps to handle a split EE decision.
  • ...and 8 more figures