I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
Yuhang Liu, Dong Gong, Yichao Cai, Erdun Gao, Zhen Zhang, Biwei Huang, Mingming Gong, Anton van den Hengel, Javen Qinfeng Shi
TL;DR
The paper assesses whether next-token prediction in LLMs can learn human-interpretable latent concepts by proposing a discrete latent-variable model with non-invertible mappings. It proves an identifiability result under a diversity condition, showing that LLM representations approximate a linear transformation of the log posterior of latent concepts, up to an invertible linear map: $\mathbf{f}_x(\mathbf{x}) \approx \mathbf{A} [\log p(\mathbf{c}|\mathbf{x})] + \mathbf{b} + o(\epsilon)$. This framework unifies the linear representation hypothesis, linking concepts-as-directions, manipulability, and linear probing through the matrix $\mathbf{A}$, and motivates evaluating sparse autoencoders via their alignment with $\log p(c^i|\mathbf{x})$, leading to the structured SAE concept. Empirically, simulations and evaluations on Pythia, Llama, and DeepSeek show that $\mathbf{A}_s \mathbf{W}_s$ approximates the identity, and structured SAEs yield higher correlations with concept posteriors than baselines, supporting the view that next-token prediction captures underlying generative factors rather than mere memorization.
Abstract
The remarkable achievements of large language models (LLMs) have led many to conclude that they exhibit a form of intelligence. This is as opposed to explanations of their capabilities based on their ability to perform relatively simple manipulations of vast volumes of data. To illuminate the distinction between these explanations, we introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables. Under mild conditions, even when the mapping from the latent space to the observed space is non-invertible, we establish an identifiability result, i.e., the representations learned by LLMs through next-token prediction can be approximately modeled as the logarithm of the posterior probabilities of these latent discrete concepts given input context, up to an invertible linear transformation. This theoretical finding not only provides evidence that LLMs capture underlying generative factors, but also provide a unified prospective for understanding of the linear representation hypothesis. Taking this a step further, our finding motivates a reliable evaluation of sparse autoencoders by treating the performance of supervised concept extractors as an upper bound. Pushing this idea even further, it inspires a structural variant that enforces dependence among latent concepts in addition to promoting sparsity. Empirically, we validate our theoretical results through evaluations on both simulation data and the Pythia, Llama, and DeepSeek model families, and demonstrate the effectiveness of our structured sparse autoencoder.
