How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding
Yuchen Li, Yuanzhi Li, Andrej Risteski
TL;DR
The paper tackles the problem of mechanistically understanding how transformers acquire topic structure. It develops a tractable analysis using a topic-modeling data distribution (LDA) and a one‑layer Transformer, proving that topic signals can be encoded either in token embeddings or in self‑attention, and identifying a two‑stage learning dynamic where embedding/value patterns emerge first under uniform attention and attention weights align later. The authors provide formal theorems and extensive experiments on synthetic data and Wikipedia to validate the mechanisms, showing block‑structured embeddings and W^V, as well as topic‑biased attention, and they discuss robustness to training settings and cross‑losses. The work offers a principled explanation for topic discovery in contextual representations and informs interpretability and training dynamics of transformers beyond mere empirical probing.
Abstract
While the successes of transformers across many domains are indisputable, accurate understanding of the learning mechanics is still largely lacking. Their capabilities have been probed on benchmarks which include a variety of structured and reasoning tasks -- but mathematical understanding is lagging substantially behind. Recent lines of work have begun studying representational aspects of this question: that is, the size/depth/complexity of attention-based networks to perform certain tasks. However, there is no guarantee the learning dynamics will converge to the constructions proposed. In our paper, we provide fine-grained mechanistic understanding of how transformers learn "semantic structure", understood as capturing co-occurrence structure of words. Precisely, we show, through a combination of mathematical analysis and experiments on Wikipedia data and synthetic data modeled by Latent Dirichlet Allocation (LDA), that the embedding layer and the self-attention layer encode the topical structure. In the former case, this manifests as higher average inner product of embeddings between same-topic words. In the latter, it manifests as higher average pairwise attention between same-topic words. The mathematical results involve several assumptions to make the analysis tractable, which we verify on data, and might be of independent interest as well.
