Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
Sangmin Bae, Jongwoo Ko, Hwanjun Song, Se-Young Yun
TL;DR
This work targets the latency bottleneck in autoregressive language models by rethinking early-exiting with a FREE framework that combines a shallow-deep module, synchronized parallel decoding, and a Beta mixture model-based adaptive threshold. By stacking early exits in the shallow path and synchronizing with deep-model computations, FREE avoids state-copying pitfalls and exploits hardware parallelism to accelerate inference without sacrificing quality. The adaptive threshold estimator eliminates heavy per-dataset calibration, achieving substantial speedups (up to 2.16x) while preserving around 99% of full-model performance across summarization, QA, and translation tasks and demonstrating applicability to large language models with LoRA. Comprehensive ablations confirm robustness to shallow-depth choices, parallel decoding, and calibration set size, while human-like evaluation corroborates competitive quality. Overall, FREE offers a practical, scalable solution for robust, fast decoding in diverse autoregressive generation settings.
Abstract
To tackle the high inference latency exhibited by autoregressive language models, previous studies have proposed an early-exiting framework that allocates adaptive computation paths for each token based on the complexity of generating the subsequent token. However, we observed several shortcomings, including performance degradation caused by a state copying mechanism or numerous exit paths, and sensitivity to exit confidence thresholds. Consequently, we propose a Fast and Robust Early-Exiting (FREE) framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. Furthermore, as parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator that exploits a Beta mixture model to determine suitable confidence thresholds. We empirically demonstrated the superiority of our proposed framework on extensive generation tasks.
