Table of Contents
Fetching ...

What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions

Liyi Zhang, Michael Y. Li, R. Thomas McCoy, Theodore R. Sumers, Jian-Qiao Zhu, Thomas L. Griffiths

TL;DR

This work proposes a Bayes-optimal framework for what embeddings learned by autoregressive LMs should represent, tying embeddings to predictive sufficient statistics. It formalizes three canonical data-generating settings—exchangeable data, latent state models, and discrete hypotheses—and shows, both analytically and via probing experiments, that transformers encode the corresponding latent distributions (e.g., suff stats, posterior over states, and topic mixtures). Through extensive synthetic and natural-corpus experiments (including Gaussian-Gamma, Beta-Bernoulli, HMM-LDA, and LDA-based topic models on 20NG and WikiText-103), the authors demonstrate that embeddings decoded by simple probes recover these quantities and generalize out-of-distribution without memorizing tokens. The findings offer a principled lens for interpretability and suggest design directions for LLM training and evaluation, notably in representing uncertainties and latent generative factors. Overall, the paper bridges Bayesian inference with deep autoregressive models and highlights predictive-sufficiency as a guiding principle for embedding content and downstream interpretability.

Abstract

Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what should embeddings represent? We connect the autoregressive prediction objective to the idea of constructing predictive sufficient statistics to summarize the information contained in a sequence of observations, and use this connection to identify three settings where the optimal content of embeddings can be identified: independent identically distributed data, where the embedding should capture the sufficient statistics of the data; latent state models, where the embedding should encode the posterior distribution over states given the data; and discrete hypothesis spaces, where the embedding should reflect the posterior distribution over hypotheses given the data. We then conduct empirical probing studies to show that transformers encode these three kinds of latent generating distributions, and that they perform well in out-of-distribution cases and without token memorization in these settings.

What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions

TL;DR

This work proposes a Bayes-optimal framework for what embeddings learned by autoregressive LMs should represent, tying embeddings to predictive sufficient statistics. It formalizes three canonical data-generating settings—exchangeable data, latent state models, and discrete hypotheses—and shows, both analytically and via probing experiments, that transformers encode the corresponding latent distributions (e.g., suff stats, posterior over states, and topic mixtures). Through extensive synthetic and natural-corpus experiments (including Gaussian-Gamma, Beta-Bernoulli, HMM-LDA, and LDA-based topic models on 20NG and WikiText-103), the authors demonstrate that embeddings decoded by simple probes recover these quantities and generalize out-of-distribution without memorizing tokens. The findings offer a principled lens for interpretability and suggest design directions for LLM training and evaluation, notably in representing uncertainties and latent generative factors. Overall, the paper bridges Bayesian inference with deep autoregressive models and highlights predictive-sufficiency as a guiding principle for embedding content and downstream interpretability.

Abstract

Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what should embeddings represent? We connect the autoregressive prediction objective to the idea of constructing predictive sufficient statistics to summarize the information contained in a sequence of observations, and use this connection to identify three settings where the optimal content of embeddings can be identified: independent identically distributed data, where the embedding should capture the sufficient statistics of the data; latent state models, where the embedding should encode the posterior distribution over states given the data; and discrete hypothesis spaces, where the embedding should reflect the posterior distribution over hypotheses given the data. We then conduct empirical probing studies to show that transformers encode these three kinds of latent generating distributions, and that they perform well in out-of-distribution cases and without token memorization in these settings.
Paper Structure (72 sections, 20 equations, 11 figures, 14 tables)

This paper contains 72 sections, 20 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: Three data generation processes where prediction of the next token $x_{n+1}$ is independent from previous tokens $x_{1:n}$ given a predictive sufficient statistic. The left corresponds to exchangeable data, the middle to latent state models, and the right to discrete hypotheses. The relevant predictive sufficient statistics are the sufficient statistic for $\theta$, $z_{n+1}, h$ (or $p(\theta|x_{1:n})$, $p(z_{n+1}|x_{1:n})$, and $p(h|x_{1:n})$ respectively). We show the embeddings learned by autoregressive transformers represent this information.
  • Figure 2: Probe recovery of transformer-learned sufficient statistic (blue) and ground truth sufficient statistic (red) on the y-axis, across 1000 test datapoints on the x-axis. In the plot above, the datapoints are sorted based on their ground truth sufficient statistic. The first row shows parameters probed in the non-OOD case (from left to right: Gaussian mean $\mu$, Gaussian precision $\tau$, Bernoulli mean, and Exponential mean). The second row shows the corresponding information in the OOD case.
  • Figure 3: (a): Two discrete hypothesis spaces $\mathcal{H}$ used in experiments. Any continuous rectangle contained within the axes (e.g., the red or the orange rectangle) is a valid hypothesis $h \in \mathcal{H}$. The data consist of a sequence of points sampled uniformly from the target rectangle. (b) and (c): Two-dimensional representation of embeddings of all validation datapoints (the setup is unequal width and sample size $=50$). The two subfigures show the same embeddings, colored by properties of the true generating rectangles.
  • Figure 4: Probing over the first 10 tokens themselves using the 10th token embedding of the transformer. Aside from perfectly encoding the 10th token, this embedding does not show memorization over the other 9 tokens as suggested by the noise in probe recovery.
  • Figure 5: Figure \ref{['fig:heatmap-lm']} and \ref{['fig:heatmap-bert']}: control experiments showing AT (left) and Bert's (middle) probe validation performance on synthetic data. For each AT and Bert, five models are trained and validated on five datasets, with each dataset generated by a distinct topic model. Colors show probe accuracy. A cell on row $i$ and column $j$ corresponds to model $i$ on dataset $j$, so the diagonal corresponds to a model on its own dataset. For AT, performance is only strong on the dataset with the same generating topic model, suggesting that the underlying statistical model, not the probe taking different word embeddings, is responsible for performance -- a relationship that is also present for Bert but to a lesser degree. Figure \ref{['fig:token-acc-perp']}: 20NG probe classification performance (accuracy) vs. negative perplexity measured at 100 different tokens. The dots are colored by the position percentile. Probe performance increases with lower perplexity.
  • ...and 6 more figures

Theorems & Definitions (1)

  • proof