Table of Contents
Fetching ...

Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding

Sangmin Bae, Jongwoo Ko, Hwanjun Song, Se-Young Yun

TL;DR

This work targets the latency bottleneck in autoregressive language models by rethinking early-exiting with a FREE framework that combines a shallow-deep module, synchronized parallel decoding, and a Beta mixture model-based adaptive threshold. By stacking early exits in the shallow path and synchronizing with deep-model computations, FREE avoids state-copying pitfalls and exploits hardware parallelism to accelerate inference without sacrificing quality. The adaptive threshold estimator eliminates heavy per-dataset calibration, achieving substantial speedups (up to 2.16x) while preserving around 99% of full-model performance across summarization, QA, and translation tasks and demonstrating applicability to large language models with LoRA. Comprehensive ablations confirm robustness to shallow-depth choices, parallel decoding, and calibration set size, while human-like evaluation corroborates competitive quality. Overall, FREE offers a practical, scalable solution for robust, fast decoding in diverse autoregressive generation settings.

Abstract

To tackle the high inference latency exhibited by autoregressive language models, previous studies have proposed an early-exiting framework that allocates adaptive computation paths for each token based on the complexity of generating the subsequent token. However, we observed several shortcomings, including performance degradation caused by a state copying mechanism or numerous exit paths, and sensitivity to exit confidence thresholds. Consequently, we propose a Fast and Robust Early-Exiting (FREE) framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. Furthermore, as parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator that exploits a Beta mixture model to determine suitable confidence thresholds. We empirically demonstrated the superiority of our proposed framework on extensive generation tasks.

Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding

TL;DR

This work targets the latency bottleneck in autoregressive language models by rethinking early-exiting with a FREE framework that combines a shallow-deep module, synchronized parallel decoding, and a Beta mixture model-based adaptive threshold. By stacking early exits in the shallow path and synchronizing with deep-model computations, FREE avoids state-copying pitfalls and exploits hardware parallelism to accelerate inference without sacrificing quality. The adaptive threshold estimator eliminates heavy per-dataset calibration, achieving substantial speedups (up to 2.16x) while preserving around 99% of full-model performance across summarization, QA, and translation tasks and demonstrating applicability to large language models with LoRA. Comprehensive ablations confirm robustness to shallow-depth choices, parallel decoding, and calibration set size, while human-like evaluation corroborates competitive quality. Overall, FREE offers a practical, scalable solution for robust, fast decoding in diverse autoregressive generation settings.

Abstract

To tackle the high inference latency exhibited by autoregressive language models, previous studies have proposed an early-exiting framework that allocates adaptive computation paths for each token based on the complexity of generating the subsequent token. However, we observed several shortcomings, including performance degradation caused by a state copying mechanism or numerous exit paths, and sensitivity to exit confidence thresholds. Consequently, we propose a Fast and Robust Early-Exiting (FREE) framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. Furthermore, as parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator that exploits a Beta mixture model to determine suitable confidence thresholds. We empirically demonstrated the superiority of our proposed framework on extensive generation tasks.
Paper Structure (38 sections, 9 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 38 sections, 9 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of our FREE framework compared to the conventional early-exiting framework. FREE exhibits three key differences: (1) FREE employs a shallow-deep module that utilizes two exit points instead of employing all layers as exit points, (2) FREE replaces the state copying mechanism (yellow colored) with synchronized parallel decoding (red colored) to prevent performance degradation while accelerating inference speed, and (3) FREE utilizes an adaptive threshold estimator to determine the appropriate threshold values for each dataset during inference.
  • Figure 2: Illustration of the ROUGE-L scores and generated sequence length from the static-exiting approach in T5-small (left) and T5-large (right) on the SAMSum dataset. The horizontal dashed line represents the average sequence length of the ground truth.
  • Figure 3: Component-wise computational cost on three datasets. Four bars correspond to full model and early-exiting with thresholds of 0.9, 0.7, and 0.5. The hatched color denotes the elapsed time after the token exits, related to the state copying mechanism. The numbers above the bars represent the ROUGE-L scores. SA and CA denote self- and cross-attention, respectively.
  • Figure 4: Overview of synchronized parallel decoding. We colored the tokens used to generate the next token based on the model that they forward.
  • Figure 5: The trade-off between the generated output quality and normalized latency under different exit conditions. We varied the exit threshold values between 0 and 1 for both CALM and FREE$^{\dagger}$ and the number of exit layers for the static-exiting framework. We exclude the inner point of the Pareto curve, and the dashed line represents the ROUGE-L score of the full model, which is the fine-tuned shallow-deep module.
  • ...and 5 more figures