Table of Contents
Fetching ...

I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?

Yuhang Liu, Dong Gong, Yichao Cai, Erdun Gao, Zhen Zhang, Biwei Huang, Mingming Gong, Anton van den Hengel, Javen Qinfeng Shi

TL;DR

The paper assesses whether next-token prediction in LLMs can learn human-interpretable latent concepts by proposing a discrete latent-variable model with non-invertible mappings. It proves an identifiability result under a diversity condition, showing that LLM representations approximate a linear transformation of the log posterior of latent concepts, up to an invertible linear map: $\mathbf{f}_x(\mathbf{x}) \approx \mathbf{A} [\log p(\mathbf{c}|\mathbf{x})] + \mathbf{b} + o(\epsilon)$. This framework unifies the linear representation hypothesis, linking concepts-as-directions, manipulability, and linear probing through the matrix $\mathbf{A}$, and motivates evaluating sparse autoencoders via their alignment with $\log p(c^i|\mathbf{x})$, leading to the structured SAE concept. Empirically, simulations and evaluations on Pythia, Llama, and DeepSeek show that $\mathbf{A}_s \mathbf{W}_s$ approximates the identity, and structured SAEs yield higher correlations with concept posteriors than baselines, supporting the view that next-token prediction captures underlying generative factors rather than mere memorization.

Abstract

The remarkable achievements of large language models (LLMs) have led many to conclude that they exhibit a form of intelligence. This is as opposed to explanations of their capabilities based on their ability to perform relatively simple manipulations of vast volumes of data. To illuminate the distinction between these explanations, we introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables. Under mild conditions, even when the mapping from the latent space to the observed space is non-invertible, we establish an identifiability result, i.e., the representations learned by LLMs through next-token prediction can be approximately modeled as the logarithm of the posterior probabilities of these latent discrete concepts given input context, up to an invertible linear transformation. This theoretical finding not only provides evidence that LLMs capture underlying generative factors, but also provide a unified prospective for understanding of the linear representation hypothesis. Taking this a step further, our finding motivates a reliable evaluation of sparse autoencoders by treating the performance of supervised concept extractors as an upper bound. Pushing this idea even further, it inspires a structural variant that enforces dependence among latent concepts in addition to promoting sparsity. Empirically, we validate our theoretical results through evaluations on both simulation data and the Pythia, Llama, and DeepSeek model families, and demonstrate the effectiveness of our structured sparse autoencoder.

I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?

TL;DR

The paper assesses whether next-token prediction in LLMs can learn human-interpretable latent concepts by proposing a discrete latent-variable model with non-invertible mappings. It proves an identifiability result under a diversity condition, showing that LLM representations approximate a linear transformation of the log posterior of latent concepts, up to an invertible linear map: . This framework unifies the linear representation hypothesis, linking concepts-as-directions, manipulability, and linear probing through the matrix , and motivates evaluating sparse autoencoders via their alignment with , leading to the structured SAE concept. Empirically, simulations and evaluations on Pythia, Llama, and DeepSeek show that approximates the identity, and structured SAEs yield higher correlations with concept posteriors than baselines, supporting the view that next-token prediction captures underlying generative factors rather than mere memorization.

Abstract

The remarkable achievements of large language models (LLMs) have led many to conclude that they exhibit a form of intelligence. This is as opposed to explanations of their capabilities based on their ability to perform relatively simple manipulations of vast volumes of data. To illuminate the distinction between these explanations, we introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables. Under mild conditions, even when the mapping from the latent space to the observed space is non-invertible, we establish an identifiability result, i.e., the representations learned by LLMs through next-token prediction can be approximately modeled as the logarithm of the posterior probabilities of these latent discrete concepts given input context, up to an invertible linear transformation. This theoretical finding not only provides evidence that LLMs capture underlying generative factors, but also provide a unified prospective for understanding of the linear representation hypothesis. Taking this a step further, our finding motivates a reliable evaluation of sparse autoencoders by treating the performance of supervised concept extractors as an upper bound. Pushing this idea even further, it inspires a structural variant that enforces dependence among latent concepts in addition to promoting sparsity. Empirically, we validate our theoretical results through evaluations on both simulation data and the Pythia, Llama, and DeepSeek model families, and demonstrate the effectiveness of our structured sparse autoencoder.

Paper Structure

This paper contains 61 sections, 5 theorems, 79 equations, 8 figures, 2 tables.

Key Result

Theorem 3.1

Under the diversity condition above, the true latent variables $\mathbf{c}$ are related to the representations in LLMs, i,e, $\mathbf{f}_\mathbf{x}(\mathbf{x})$, which are learned through the next-token prediction framework, by the following relationship: where ${\mathbf{h}_y} ={[h_{y_{1}}-h_{y_{0}},...,h_{y_{\ell}}-h_{y_{0}}]}$ with $h_{y_k}=[{p(\mathbf{c}=\mathbf{c}_i|y=y_k)}]^{T}_{\mathbf{c}_i

Figures (8)

  • Figure 1: An overview of the main contributions of this work. On the left, we illustrate the proposed latent variable model that represents concepts as latent variables $\mathbf{c}$, which are used to generate both the input $\mathbf{x}$ and output $y$ within a next-token prediction framework. Leveraging Bayes' rule, the next-token prediction framework, and the diversity condition, we establish an identifiability result: the representations learned by LLMs approximately correspond to a linear transformation of the logarithm of the posterior distribution of latent variables conditioned on input tokens, i.e., $\mathbf{f}_\mathbf{x}(\mathbf{x}) = \mathbf{A} [\log{p(\mathbf{c}=\mathbf{c}_i|\mathbf{x})}]_{\mathbf{c}_i} + \mathbf{b}+o({\epsilon})$, where $\mathbf{b}$ is a constant, and $o(\epsilon)$ represents a term that grows asymptotically smaller than $\epsilon$ as $\epsilon \to 0$. This identifiability result provide a support for the linear representation hypothesis, and sparse autoencoder.
  • Figure 2: (Left) Classification accuracy under varying numbers of observed variables. (Right) Classification accuracy across different graph structures.
  • Figure 3: Results of the product $\mathbf{A}_s \times \mathbf{W}_s$ across the LLaMA-2 and Pythia model families. Here, $\mathbf{A}_s$ represents a matrix derived from the feature differences of 27 counterfactual pairs, while $\mathbf{W}_s$ is a weight matrix obtained from a linear classifier trained on these features. The product approximates the identity matrix, supporting the theoretical findings outlined in Corollary \ref{['property2']}.
  • Figure 4: Comparison of SAE models: correlation scores and reconstruction loss on the validation dataset.
  • Figure 5: Classification accuracy of logistic probes across various concepts. Each bar represents the performance for a given concept.
  • ...and 3 more figures

Theorems & Definitions (11)

  • Definition 2.1
  • Theorem 3.1
  • Definition 4.1
  • Corollary 4.2: Concepts Are Encoded in the Matrix $\mathbf{A}$
  • Corollary 4.3: Linear Classifiability of Representations
  • proof
  • proof
  • Corollary F.1: Binary Concept Direction
  • proof
  • Corollary F.2: Binary Concept Classification
  • ...and 1 more