Table of Contents
Fetching ...

Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Daria Lioubashevski, Tomer Schlank, Gabriel Stanovsky, Ariel Goldstein

TL;DR

This work analyzes the computation performed by Transformers in the layers after the top-1 prediction has become fixed, and proposes an underlying mechanism of task transition for this sequential saturation, where task k corresponds to predicting the k-th most probable token, and the saturation events are in fact discrete transitions between the tasks.

Abstract

Understanding the inner workings of Transformers is crucial for achieving more accurate and efficient predictions. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed, which has been previously referred to as the "saturation event". We expand the concept of saturation events for top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these saturation events happen in order of the corresponding tokens' ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different architectural variants (decoder-only, encoder-only, and to a lesser extent full-Transformer), and even in untrained Transformers. We propose an underlying mechanism of task transition for this sequential saturation, where task k corresponds to predicting the k-th most probable token, and the saturation events are in fact discrete transitions between the tasks. In support of this we show that it is possible to predict the current task from hidden layer embedding. Furthermore, using an intervention method we demonstrate that we can cause the model to switch from one task to the next. Finally, leveraging our findings, we introduce a novel token-level early-exit strategy, which surpasses existing methods in balancing performance and efficiency.

Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

TL;DR

This work analyzes the computation performed by Transformers in the layers after the top-1 prediction has become fixed, and proposes an underlying mechanism of task transition for this sequential saturation, where task k corresponds to predicting the k-th most probable token, and the saturation events are in fact discrete transitions between the tasks.

Abstract

Understanding the inner workings of Transformers is crucial for achieving more accurate and efficient predictions. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed, which has been previously referred to as the "saturation event". We expand the concept of saturation events for top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these saturation events happen in order of the corresponding tokens' ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different architectural variants (decoder-only, encoder-only, and to a lesser extent full-Transformer), and even in untrained Transformers. We propose an underlying mechanism of task transition for this sequential saturation, where task k corresponds to predicting the k-th most probable token, and the saturation events are in fact discrete transitions between the tasks. In support of this we show that it is possible to predict the current task from hidden layer embedding. Furthermore, using an intervention method we demonstrate that we can cause the model to switch from one task to the next. Finally, leveraging our findings, we introduce a novel token-level early-exit strategy, which surpasses existing methods in balancing performance and efficiency.

Paper Structure

This paper contains 26 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An illustration of the proposed task-transition mechanism wherein the layers of the Transformer perform a changing number of tasks in order, so that task $i$ is determining the $i$-th token in the final ranking, and the transition between task $i$ and task $i+1$ occurs at the corresponding $i$-th saturation layer. The transition is akin to a switch being flipped "on" and staying "on" for the remaining layers representing the $i$-th token being fixed from this point onward.
  • Figure 2: Schematic of our framework and visualization of the ordered saturation of the top-k tokens on GPT2-XL. The hidden states from each layer are projected onto the vocabulary space using the unembedding matrix $E$, then sorted in descending order and treated as rankings. The saturation effect is marked separately for each token in the top-4 of the final ranking, emphasizing the fact that the 2nd token saturates after the 1st token, the 3d token saturates after the 2nd token and so on. The dashed line represents the previously established saturation event of the top-1 token.
  • Figure 3: Average rank of the $k$-th saturation layer among the saturation layers for k=1,..,5 with standard error bars. Asterisks indicate statistically significant differences between consecutive token ranks (*** represents $p < 0.001$), based on an independent samples t-test.
  • Figure 4: Left: Forward pass of two input tokens ("wanted" and "the") in the same context for which the model's final top-1 prediction is the same ("artist"), but the 1st saturation layers are different (25 and 40 respectively). Right: By injecting the output from the top-1 saturation layer of "the" as input to the subsequent layer of "artist", we trigger a saturation at the injected layer (26) in the post-intervention run, without altering the top-1 prediction. Saturation layers are marked in bold. The use of activations from adjacent layers is not depicted for the sake of clarity.
  • Figure 5: Flipping the Top-1 Switch. The percentage of examples where the top-1 saturation occurred at the injected layer after the intervention, shown as a function of the layer from which the injected activations were taken, relative to the original saturation layer (e.g., $-2$ means activations were taken from two layers before the original saturation layer).
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 2.1: 1st Saturation Layer; geva2022transformer
  • Definition 2.2: $k$-th Saturation Layer