Accelerating Large Language Model Inference with Self-Supervised Early Exits
Florian Valade
TL;DR
This work tackles the high computational cost of large language model inference by introducing modular early exit heads placed at intermediate transformer layers. Each head is trained via a self-supervised objective to faithfully mimic the full model, and a calibrated entropy-based confidence threshold ε governs when a head’s prediction is accepted, enabling safe early termination. The authors extend the approach to Dynamic Self-Speculative Decoding (DSSD), which unifies head selection and speculative drafting into a single threshold and achieves 1.66× higher token acceptance than manually tuned LayerSkip baselines on the Pythia suite, with zero hyperparameter tuning. Across model sizes from 70M to 2.8B parameters, the method provides meaningful speedups (notably larger for bigger models) while preserving accuracy on multiple benchmarks, highlighting practical potential for efficient, privacy-conscious, device-friendly LLM deployment. The work advances practical acceleration techniques by leveraging self-supervised head training, entropy-based calibration, and adaptive speculative decoding to reduce wasted computation without retraining the backbone.
Abstract
This paper presents a modular approach to accelerate inference in large language models (LLMs) by adding early exit heads at intermediate transformer layers. Each head is trained in a self-supervised manner to mimic the main model's predictions, allowing computation to stop early when a calibrated confidence threshold is reached. We evaluate several confidence metrics and show that entropy provides the most reliable separation between correct and incorrect predictions. Experiments on the Pythia model suite (70M to 2.8B parameters) demonstrate that our method significantly reduces inference cost while maintaining accuracy across multiple benchmarks. We further adapt this approach to speculative decoding, introducing Dynamic Self-Speculative Decoding (DSSD), which achieves 1.66x higher token acceptance than manually-tuned LayerSkip baselines with minimal hyperparameter tuning.
