Table of Contents
Fetching ...

Dynamic layer selection in decoder-only transformers

Theodore Glavas, Joud Chataoui, Florence Regol, Wassim Jabbour, Antonios Valkanas, Boris N. Oreshkin, Mark Coates

TL;DR

A pre-trained decoder-only model is significantly more robust to layer removal via layer skipping, as opposed to early exit, and it is shown that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains by constructing an oracle controller.

Abstract

The vast size of Large Language Models (LLMs) has prompted a search to optimize inference. One effective approach is dynamic inference, which adapts the architecture to the sample-at-hand to reduce the overall computational cost. We empirically examine two common dynamic inference methods for natural language generation (NLG): layer skipping and early exiting. We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping, as opposed to early exit. We demonstrate the difficulty of using hidden state information to adapt computation on a per-token basis for layer skipping. Finally, we show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains by constructing an oracle controller. Remarkably, we find that there exists an allocation which achieves equal performance to the full model using only 23.3% of its layers on average.

Dynamic layer selection in decoder-only transformers

TL;DR

A pre-trained decoder-only model is significantly more robust to layer removal via layer skipping, as opposed to early exit, and it is shown that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains by constructing an oracle controller.

Abstract

The vast size of Large Language Models (LLMs) has prompted a search to optimize inference. One effective approach is dynamic inference, which adapts the architecture to the sample-at-hand to reduce the overall computational cost. We empirically examine two common dynamic inference methods for natural language generation (NLG): layer skipping and early exiting. We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping, as opposed to early exit. We demonstrate the difficulty of using hidden state information to adapt computation on a per-token basis for layer skipping. Finally, we show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains by constructing an oracle controller. Remarkably, we find that there exists an allocation which achieves equal performance to the full model using only 23.3% of its layers on average.

Paper Structure

This paper contains 39 sections, 5 equations, 10 figures.

Figures (10)

  • Figure 1: Using OPT-1.3B (24 layers) on the Alpaca dataset. We report the mean and 95% confidence interval on the test set. (a): Average cosine similarity of the final hidden states $\mathbf{h}^L_{t}$ (top) and layerwise intermediate hidden states $\mathbf{h}^l_{t}, 1\leq l \leq 23$ (bottom) obtained from early exit (EE), random and uniform layer skipping (RLS/ULS) compared to the full model execution. (b) Comparison of cost-performance curves when training skip controllers to use the hidden states $\mathbf{h}^{l-1}$ or a fixed input.We see that, with statistical significance, using the hidden state is not helpful to the model's performance.
  • Figure 2: ROUGE-L (right axis) of $T$ and $T_\text{ULS, x}$ compared to an oracle skip controller that selects the optimal model per sequence given a global budget. For each budget, the selection percentage of each model by the oracle is shown as a stacked plot (left axis). The yellow star denotes where the oracle can match the performance of the full model, using an average of 5.6/24 layers ($23.3\%$).
  • Figure 3: Uniform layer skipping strategy of a 24 layer model following Equation \ref{['eq:ULS']} for different computational cost $c$.
  • Figure 4: Using OPT-1.3B (24 layers) on the Alpaca dataset. We report the mean and 95% confidence interval on the test set. Average cosine similarity of the final hidden states $\mathbf{h}^L_{t}$ (top) and layerwise intermediate hidden states $\mathbf{h}^l_{t}, 1\leq l \leq 23$ (bottom) obtained from each dynamic route strategy compared to the full model execution. We include RLS wo/1, random layer skipping without enforcing the execution of layer 1, and see that it performs consistently worse that RLS.
  • Figure 5: Using OPT-1.3B (24 layers) on the CNN-DM dataset. We report the mean and 95% confidence interval on the test set. The cosine similarity of the final hidden states $\mathbf{h}^L_{t}$ (top) and layerwise intermediate hidden states $\mathbf{h}^l_{t}, 1\leq l \leq 23$ (bottom) obtained from early exit (EE), random and uniform layer skipping (RLS/ULS) is shown compared to the full model execution.
  • ...and 5 more figures