Why transformers are obviously good models of language
Felix Hill
TL;DR
Transformers are presented as highly plausible models of language because their mechanisms align with cognitive-linguistic theories (e.g., prototype semantics and construction grammar). The paper traces the evolution from word prototypes and distributed lexical representations through contextualised word and sentence representations, highlighting BERT and GPT as frameworks that encode both intrinsic and contextual meaning via large-scale pretraining and prompting. It argues that multi-head attention supports quasi top-down syntactic processing and garden-path-like phenomena without explicit local biases, enabling robust long-range dependency modeling within a context window of about $4000$ tokens. The discussion also notes limitations such as factual inaccuracies and the need for grounding, calling for future work that integrates robust inference with linguistics-informed theories to advance both AI and linguistic science.
Abstract
Nobody knows how language works, but many theories abound. Transformers are a class of neural networks that process language automatically with more success than alternatives, both those based on neural computations and those that rely on other (e.g. more symbolic) mechanisms. Here, I highlight direct connections between the transformer architecture and certain theoretical perspectives on language. The empirical success of transformers relative to alternative models provides circumstantial evidence that the linguistic approaches that transformers embody should be, at least, evaluated with greater scrutiny by the linguistics community and, at best, considered to be the currently best available theories.
