Table of Contents
Fetching ...

Prediction hubs are context-informed frequent tokens in LLMs

Beatrix M. G. Nielsen, Iuri Macocco, Marco Baroni

TL;DR

This work analyzes hubness in autoregressive LLMs, distinguishing the probability-distance used for next-token prediction from standard distance measures. It proves that concentration of distances does not occur for non-uniform token probabilities with increasing representation dimensionality, yet finds that hubs still arise, these hubs reflecting context-modulated frequent tokens rather than noise. Empirically, five diverse LLMs exhibit high hubness in probability distance without distance concentration, whereas Euclidean-based comparisons reveal nuisance hubs and variable distance distributions across models. The results imply that hubness is not inherently detrimental for next-token prediction, but caution is warranted when applying Euclidean or cosine-based similarity analyses to LLM representations outside the prediction task.

Abstract

Hubness, the tendency for a few points to be among the nearest neighbours of a disproportionate number of other points, commonly arises when applying standard distance measures to high-dimensional data, often negatively impacting distance-based analysis. As autoregressive large language models (LLMs) operate on high-dimensional representations, we ask whether they are also affected by hubness. We first prove that the only large-scale representation comparison operation performed by LLMs, namely that between context and unembedding vectors to determine continuation probabilities, is not characterized by the concentration of distances phenomenon that typically causes the appearance of nuisance hubness. We then empirically show that this comparison still leads to a high degree of hubness, but the hubs in this case do not constitute a disturbance. They are rather the result of context-modulated frequent tokens often appearing in the pool of likely candidates for next token prediction. However, when other distances are used to compare LLM representations, we do not have the same theoretical guarantees, and, indeed, we see nuisance hubs appear. There are two main takeaways. First, hubness, while omnipresent in high-dimensional spaces, is not a negative property that needs to be mitigated when LLMs are being used for next token prediction. Second, when comparing representations from LLMs using Euclidean or cosine distance, there is a high risk of nuisance hubs and practitioners should use mitigation techniques if relevant.

Prediction hubs are context-informed frequent tokens in LLMs

TL;DR

This work analyzes hubness in autoregressive LLMs, distinguishing the probability-distance used for next-token prediction from standard distance measures. It proves that concentration of distances does not occur for non-uniform token probabilities with increasing representation dimensionality, yet finds that hubs still arise, these hubs reflecting context-modulated frequent tokens rather than noise. Empirically, five diverse LLMs exhibit high hubness in probability distance without distance concentration, whereas Euclidean-based comparisons reveal nuisance hubs and variable distance distributions across models. The results imply that hubness is not inherently detrimental for next-token prediction, but caution is warranted when applying Euclidean or cosine-based similarity analyses to LLM representations outside the prediction task.

Abstract

Hubness, the tendency for a few points to be among the nearest neighbours of a disproportionate number of other points, commonly arises when applying standard distance measures to high-dimensional data, often negatively impacting distance-based analysis. As autoregressive large language models (LLMs) operate on high-dimensional representations, we ask whether they are also affected by hubness. We first prove that the only large-scale representation comparison operation performed by LLMs, namely that between context and unembedding vectors to determine continuation probabilities, is not characterized by the concentration of distances phenomenon that typically causes the appearance of nuisance hubness. We then empirically show that this comparison still leads to a high degree of hubness, but the hubs in this case do not constitute a disturbance. They are rather the result of context-modulated frequent tokens often appearing in the pool of likely candidates for next token prediction. However, when other distances are used to compare LLM representations, we do not have the same theoretical guarantees, and, indeed, we see nuisance hubs appear. There are two main takeaways. First, hubness, while omnipresent in high-dimensional spaces, is not a negative property that needs to be mitigated when LLMs are being used for next token prediction. Second, when comparing representations from LLMs using Euclidean or cosine distance, there is a high risk of nuisance hubs and practitioners should use mitigation techniques if relevant.

Paper Structure

This paper contains 22 sections, 2 theorems, 6 equations, 34 figures, 18 tables.

Key Result

Theorem 1

Let $\mathbf{x}_i \in X$ be a data point. Let $\mathbf{y}_j$, $j\in \{1, ..., v\}$, be the possible labels of points from $X$, and let $p(\mathbf{y}_j|\mathbf{x})$ be the probability of label $\mathbf{y}_j$ given $\mathbf{x}$ which uses representations $\mathbf{f}(\mathbf{x}), \mathbf{g}(\mathbf{y})

Figures (34)

  • Figure 1: Illustrative example of concentration of distances and k-occurrence. (Top) Distribution of 10,000 Euclidean distances between query and comparison points from a standard Gaussian in 3 and 300 dimensions. In 300 dimensions, no pair of points has a distance between 0 and 20, and most have a distance around 25, so the distances "concentrate". (Bottom) K-occurrence distributions for the data in (Top). For 3 dimensions, k-skew is close to 0, so the neighbour relation is symmetric. For 300 dimensions, k-skew is quite high (about 12), so the neighbour relation is very skewed in accordance with the data exhibiting a concentration of distances.
  • Figure 2: Probability distance distribution for Pythia on contexts from Pile10k. If we had had a concentration of distances, we would not see this spread of distances all the way to zero (compare with Fig. \ref{['fig:concentration_dist_syn_data']}).
  • Figure 3: k-occurrence distribution for Pythia predictions on contexts from Pile10k. This distribution is highly skewed with many hubs (points with $k$-occurrence larger than 100).
  • Figure 4: $k$-occurrence of hubs in Pythia predictions on contexts from Pile10k vs. frequency of vocabulary items in Pile10k. $\rho$ is the Spearman correlation.
  • Figure 5: $k$-occurrence of hubs in Pythia predictions (x-axis) vs. frequency of tokens (y-axis). $\rho$ is the Spearman correlation. Top row: Predictions made on contexts from Pile10k. Bottom row: Predictions made on contexts from Bookcorpus. First column: Frequency of tokens in Pile10k. Second column: Frequency of tokens in Bookcorpus. In both cases, correlation is higher when frequency is estimated on the same corpus as the contexts used for prediction.
  • ...and 29 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 1
  • proof