Table of Contents
Fetching ...

Dynamic Vocabulary Pruning in Early-Exit LLMs

Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec

TL;DR

This work proposes dynamically pruning the vocabulary at test time for each token, and demonstrates that such post-hoc dynamic vocabulary pruning improves the efficiency of confidence estimation in early-exit LLMs while maintaining competitive performance.

Abstract

Increasing the size of large language models (LLMs) has been shown to lead to better performance. However, this comes at the cost of slower and more expensive inference. Early-exiting is a promising approach for improving the efficiency of LLM inference by enabling next token prediction at intermediate layers. Yet, the large vocabulary size in modern LLMs makes the confidence estimation required for exit decisions computationally expensive, diminishing the efficiency gains. To address this, we propose dynamically pruning the vocabulary at test time for each token. Specifically, the vocabulary is pruned at one of the initial layers, and the smaller vocabulary is then used throughout the rest of the forward pass. Our experiments demonstrate that such post-hoc dynamic vocabulary pruning improves the efficiency of confidence estimation in early-exit LLMs while maintaining competitive performance.

Dynamic Vocabulary Pruning in Early-Exit LLMs

TL;DR

This work proposes dynamically pruning the vocabulary at test time for each token, and demonstrates that such post-hoc dynamic vocabulary pruning improves the efficiency of confidence estimation in early-exit LLMs while maintaining competitive performance.

Abstract

Increasing the size of large language models (LLMs) has been shown to lead to better performance. However, this comes at the cost of slower and more expensive inference. Early-exiting is a promising approach for improving the efficiency of LLM inference by enabling next token prediction at intermediate layers. Yet, the large vocabulary size in modern LLMs makes the confidence estimation required for exit decisions computationally expensive, diminishing the efficiency gains. To address this, we propose dynamically pruning the vocabulary at test time for each token. Specifically, the vocabulary is pruned at one of the initial layers, and the smaller vocabulary is then used throughout the rest of the forward pass. Our experiments demonstrate that such post-hoc dynamic vocabulary pruning improves the efficiency of confidence estimation in early-exit LLMs while maintaining competitive performance.

Paper Structure

This paper contains 11 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Left: Illustration of our vocabulary pruning setup in Transformer models during inference. The model evaluates the input question with an Early Exiting objective where the vocabulary is reduced at a fixed layer $p = 2$ in the reference figure. At each layer $\ell$, the model computes a confidence estimation $c_t^\ell$ and compares it against a threshold $\lambda_t^\ell$. When the model achieves sufficient confidence about the token to predict at layer $\ell+1$, the token is returned. Right: Visualization of our proposed pruning mechanism. At exit $p$, we first identify the top $K$ most likely tokens, which are used to subsample the rows of the unembedding matrix $\mathbf{W}$. The resulting pruned matrix $\mathbf{W}_t$ is then used for confidence estimation at all subsequent exits.
  • Figure 2: Rank (log-scale) of the final predicted token across model exits/layers on SQuAD rajpurkar2016squad and SamSum gliwa2019samsum using the early-exit version of the T5-large model bae2023fast. We observe a clear trend of very early layers showing a low average rank for the final predicted tokens, which motivates our dynamic vocabulary pruning approach.
  • Figure 3: Rank (log-scale) of the final predicted token across model exits/layers on SQuAD rajpurkar2016squad and SamSum gliwa2019samsum. Left: Results based on CALM schuster2022confident, the early-exit version of the T5-large model bae2023fast. These are the same results as those shown in Figure \ref{['fig:pruning_final']}, included here for easier comparison. Right: Results based on the T5-large model 2020t5, where the non-adapted original unembedding matrix is used at intermediate layers to facilitate early-exiting.