Understanding Token Probability Encoding in Output Embeddings

Hakaze Cho; Yoshihiro Sakai; Kenshiro Tanaka; Mariko Kato; Naoya Inoue

Understanding Token Probability Encoding in Output Embeddings

Hakaze Cho, Yoshihiro Sakai, Kenshiro Tanaka, Mariko Kato, Naoya Inoue

TL;DR

The paper reveals that next-token probabilities in language models are encoded in a common, sparse direction within the output embedding, yielding a log-linear relation $-\\log \alpha_{w,\mathcal{D}, \theta} \approx A_{\mathcal{D}} \cdot E^{(o)}_w + B_{\mathcal{D}}$ under a concentrated-logit regime. It demonstrates causality by steering probabilities along this direction using the update $E_w^{(o)\prime} \leftarrow E_w^{(o)} - \log(r) \; \Omega \; A_{\mathcal{D}}^{-1}$, achieving accurate control up to $\sim 20\times$ and generalizing from few-shot data. The study shows the encoding is sparse, enabling removal of about 30–40% of output-embedding dimensions with minimal impact on distribution and generation, highlighting a potential saliency-based pruning paradigm. Additionally, corpus token frequency is encoded in the output embedding at very early training steps, suggesting the embedding aligns with the training data distribution from the outset. These findings offer a principled lens for understanding LM heads, with implications for model editing, pruning, and saliency scoring.

Abstract

In this paper, we investigate the output token probability information in the output embedding of language models. We find an approximate common log-linear encoding of output token probabilities within the output embedding vectors and empirically demonstrate that it is accurate and sparse. As a causality examination, we steer the encoding in output embedding to modify the output probability distribution accurately. Moreover, the sparsity we find in output probability encoding suggests that a large number of dimensions in the output embedding do not contribute to causal language modeling. Therefore, we attempt to delete the output-unrelated dimensions and find more than 30% of the dimensions can be deleted without significant movement in output distribution and sequence generation. Additionally, in the pre-training dynamics of language models, we find that the output embeddings capture the corpus token frequency information in early steps, even before an obvious convergence of parameters starts.

Understanding Token Probability Encoding in Output Embeddings

TL;DR

The paper reveals that next-token probabilities in language models are encoded in a common, sparse direction within the output embedding, yielding a log-linear relation

under a concentrated-logit regime. It demonstrates causality by steering probabilities along this direction using the update

, achieving accurate control up to

and generalizing from few-shot data. The study shows the encoding is sparse, enabling removal of about 30–40% of output-embedding dimensions with minimal impact on distribution and generation, highlighting a potential saliency-based pruning paradigm. Additionally, corpus token frequency is encoded in the output embedding at very early training steps, suggesting the embedding aligns with the training data distribution from the outset. These findings offer a principled lens for understanding LM heads, with implications for model editing, pruning, and saliency scoring.

Abstract

Paper Structure (38 sections, 8 equations, 17 figures, 3 tables)

This paper contains 38 sections, 8 equations, 17 figures, 3 tables.

Introduction
Token Probability Encoding in Output Embedding
Mathematical Log-linear Form
Empirical Confirmation
Token Probability Steering on Output Embeddings
Algorithm
Experiment Settings
Metrics.
Evaluations.
Results
Wide-scale stable: Large-scaled probability steering is supported by a global log-linear pattern.
Few-shot generalizable: Encoding remains distinct even by an $A_\mathcal{D}$ estimated by few-shot corpus.
Removing Dimensions with Weak Probability Encoding
Method & Experiment Settings
Experiment Settings.
...and 23 more sections

Figures (17)

Figure 1: The PCA result of the output embedding vectors of GPT2. Colors refer to the ranking percentile of the averaged output token probability.
Figure 2: Only a few directions/dimensions of output embedding are strongly correlated to the output probabilities.(a-d): horizontal axis: the principle components of output embedding, vertical axis: absolute Spearman $r$ between the principle and the output probability distribution, color bar: the variance ratio loaded in the principal component; (e-h): horizontal axis: original dimensions, vertical axis: absolute MLR slopes between this dimension and the output probability distribution, color bar: the absolute Spearman correlations on the dimension.
Figure 3: The MLR results on GPT2 and GPT-J.
Figure 4: The $e_{ood}$ on detect datasets with various numbers of sentence (averaged token per sentence $\approx 134$).
Figure 5: The expected probability scales against the actually steered scales measured in the steered LMs.
...and 12 more figures

Understanding Token Probability Encoding in Output Embeddings

TL;DR

Abstract

Understanding Token Probability Encoding in Output Embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (17)