Table of Contents
Fetching ...

Accelerating Large Language Model Inference with Self-Supervised Early Exits

Florian Valade

TL;DR

This work tackles the high computational cost of large language model inference by introducing modular early exit heads placed at intermediate transformer layers. Each head is trained via a self-supervised objective to faithfully mimic the full model, and a calibrated entropy-based confidence threshold ε governs when a head’s prediction is accepted, enabling safe early termination. The authors extend the approach to Dynamic Self-Speculative Decoding (DSSD), which unifies head selection and speculative drafting into a single threshold and achieves 1.66× higher token acceptance than manually tuned LayerSkip baselines on the Pythia suite, with zero hyperparameter tuning. Across model sizes from 70M to 2.8B parameters, the method provides meaningful speedups (notably larger for bigger models) while preserving accuracy on multiple benchmarks, highlighting practical potential for efficient, privacy-conscious, device-friendly LLM deployment. The work advances practical acceleration techniques by leveraging self-supervised head training, entropy-based calibration, and adaptive speculative decoding to reduce wasted computation without retraining the backbone.

Abstract

This paper presents a modular approach to accelerate inference in large language models (LLMs) by adding early exit heads at intermediate transformer layers. Each head is trained in a self-supervised manner to mimic the main model's predictions, allowing computation to stop early when a calibrated confidence threshold is reached. We evaluate several confidence metrics and show that entropy provides the most reliable separation between correct and incorrect predictions. Experiments on the Pythia model suite (70M to 2.8B parameters) demonstrate that our method significantly reduces inference cost while maintaining accuracy across multiple benchmarks. We further adapt this approach to speculative decoding, introducing Dynamic Self-Speculative Decoding (DSSD), which achieves 1.66x higher token acceptance than manually-tuned LayerSkip baselines with minimal hyperparameter tuning.

Accelerating Large Language Model Inference with Self-Supervised Early Exits

TL;DR

This work tackles the high computational cost of large language model inference by introducing modular early exit heads placed at intermediate transformer layers. Each head is trained via a self-supervised objective to faithfully mimic the full model, and a calibrated entropy-based confidence threshold ε governs when a head’s prediction is accepted, enabling safe early termination. The authors extend the approach to Dynamic Self-Speculative Decoding (DSSD), which unifies head selection and speculative drafting into a single threshold and achieves 1.66× higher token acceptance than manually tuned LayerSkip baselines on the Pythia suite, with zero hyperparameter tuning. Across model sizes from 70M to 2.8B parameters, the method provides meaningful speedups (notably larger for bigger models) while preserving accuracy on multiple benchmarks, highlighting practical potential for efficient, privacy-conscious, device-friendly LLM deployment. The work advances practical acceleration techniques by leveraging self-supervised head training, entropy-based calibration, and adaptive speculative decoding to reduce wasted computation without retraining the backbone.

Abstract

This paper presents a modular approach to accelerate inference in large language models (LLMs) by adding early exit heads at intermediate transformer layers. Each head is trained in a self-supervised manner to mimic the main model's predictions, allowing computation to stop early when a calibrated confidence threshold is reached. We evaluate several confidence metrics and show that entropy provides the most reliable separation between correct and incorrect predictions. Experiments on the Pythia model suite (70M to 2.8B parameters) demonstrate that our method significantly reduces inference cost while maintaining accuracy across multiple benchmarks. We further adapt this approach to speculative decoding, introducing Dynamic Self-Speculative Decoding (DSSD), which achieves 1.66x higher token acceptance than manually-tuned LayerSkip baselines with minimal hyperparameter tuning.
Paper Structure (34 sections, 13 equations, 4 figures, 1 table)

This paper contains 34 sections, 13 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: ROC curves for three confidence metrics across six Pythia models (70M to 2.8B). Entropy consistently achieves the highest AUC.
  • Figure 2: Speedup vs. confidence threshold $\epsilon$ for different Pythia model sizes.
  • Figure 3: Adaptive head selection: token distribution across heads H0(L6), H1(L12), H2(L18), H3(L24) at different accuracy thresholds $\epsilon$, where notation $H_i(L_j)$ indicates head $i$ at layer $j$. As $\epsilon$ decreases, the system shifts from primarily using H0 to a balanced mix across all heads.
  • Figure 4: Acceptance rate vs mean exit layer during drafting across all configurations. The dashed green line traces DSSD's frontier (diamonds), while LayerSkip configurations (circles) are dominated. The star highlights DSSD at $\epsilon=0.9$, achieving 88% acceptance with only 13.6 mean exit layers.