Table of Contents
Fetching ...

Early Exit Is a Natural Capability in Transformer-based Models: An Empirical Study on Early Exit without Joint Optimization

Weiqiao Shan, Long Meng, Tong Zheng, Yingfeng Luo, Bei Li, junxin Wang, Tong Xiao, Jingbo Zhu

TL;DR

This paper challenges the necessity of extra output layers and joint optimization for early exit (EE) in transformer-based large language models. It demonstrates that EE is a natural capability of transformer architectures and can occur without additional parameters, though joint optimization improves gating accuracy by aligning inter-layer output distributions. The study extends EE observations across encoder-only, encoder-decoder, and decoder models, and analyzes sub-word and sub-layer patterns in LLaMA to elucidate exit behavior. It also investigates token-level EE with KV-copy, highlighting challenges in long-sequence generation and offering insights into skip connections and potential alternative gating signals. The findings have practical implications for speeding up decoding in LLMs and provide a nuanced view of when and how EE can be leveraged effectively.

Abstract

Large language models (LLMs) exhibit exceptional performance across various downstream tasks. However, they encounter limitations due to slow inference speeds stemming from their extensive parameters. The early exit (EE) is an approach that aims to accelerate auto-regressive decoding. EE generates outputs from intermediate layers instead of using the whole model, which offers a promising solution to this challenge. However, additional output layers and joint optimization used in conventional EE hinder the application of EE in LLMs. In this paper, we explore the possibility of LLMs EE without additional output layers and joint optimization. Our findings indicate that EE is a natural capability within transformer-based models. While joint optimization does not give model EE capability, it must be employed to address challenges by improving the accuracy of locating the optimal EE layer through gating functions. Additionally, our study reveals patterns in EE behavior from a sub-word perspective based on the LLaMA model and the potential possibility for EE based on sub-layers.

Early Exit Is a Natural Capability in Transformer-based Models: An Empirical Study on Early Exit without Joint Optimization

TL;DR

This paper challenges the necessity of extra output layers and joint optimization for early exit (EE) in transformer-based large language models. It demonstrates that EE is a natural capability of transformer architectures and can occur without additional parameters, though joint optimization improves gating accuracy by aligning inter-layer output distributions. The study extends EE observations across encoder-only, encoder-decoder, and decoder models, and analyzes sub-word and sub-layer patterns in LLaMA to elucidate exit behavior. It also investigates token-level EE with KV-copy, highlighting challenges in long-sequence generation and offering insights into skip connections and potential alternative gating signals. The findings have practical implications for speeding up decoding in LLMs and provide a nuanced view of when and how EE can be leveraged effectively.

Abstract

Large language models (LLMs) exhibit exceptional performance across various downstream tasks. However, they encounter limitations due to slow inference speeds stemming from their extensive parameters. The early exit (EE) is an approach that aims to accelerate auto-regressive decoding. EE generates outputs from intermediate layers instead of using the whole model, which offers a promising solution to this challenge. However, additional output layers and joint optimization used in conventional EE hinder the application of EE in LLMs. In this paper, we explore the possibility of LLMs EE without additional output layers and joint optimization. Our findings indicate that EE is a natural capability within transformer-based models. While joint optimization does not give model EE capability, it must be employed to address challenges by improving the accuracy of locating the optimal EE layer through gating functions. Additionally, our study reveals patterns in EE behavior from a sub-word perspective based on the LLaMA model and the potential possibility for EE based on sub-layers.

Paper Structure

This paper contains 51 sections, 3 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Percentage of tokens at each layer across the entire test set where a token first matches the final output during inference.
  • Figure 2: Example of Llama-2-Chat-13B when doing translation task on WMT22-ZH2EN test set, we found that the Top-1 hypothesis is same with the final Top-1 output far away from the final layer($\text{Block}_{40}$).
  • Figure 3: The average distance and speed-up between the optimal early exit layer and the exit layer from the three gating functions in RoBERTa model(more details in \ref{['sec:appendix-gating-function directly-on-bert-like-model']}). We constrain the performance of early exit not less than 98% original model to obtain the optimal threshold for gating functions.
  • Figure 4: The average confidence score between each layer overall sample in the GLUE benchmark based on the RoBERTa model.
  • Figure 5: The relationship between the average early exit layer and sequence length based on the Llama-2-Chat-13B model in WMT22 translation tasks.
  • ...and 6 more figures