Why transformers are obviously good models of language

Felix Hill

Why transformers are obviously good models of language

Felix Hill

TL;DR

Transformers are presented as highly plausible models of language because their mechanisms align with cognitive-linguistic theories (e.g., prototype semantics and construction grammar). The paper traces the evolution from word prototypes and distributed lexical representations through contextualised word and sentence representations, highlighting BERT and GPT as frameworks that encode both intrinsic and contextual meaning via large-scale pretraining and prompting. It argues that multi-head attention supports quasi top-down syntactic processing and garden-path-like phenomena without explicit local biases, enabling robust long-range dependency modeling within a context window of about $4000$ tokens. The discussion also notes limitations such as factual inaccuracies and the need for grounding, calling for future work that integrates robust inference with linguistics-informed theories to advance both AI and linguistic science.

Abstract

Nobody knows how language works, but many theories abound. Transformers are a class of neural networks that process language automatically with more success than alternatives, both those based on neural computations and those that rely on other (e.g. more symbolic) mechanisms. Here, I highlight direct connections between the transformer architecture and certain theoretical perspectives on language. The empirical success of transformers relative to alternative models provides circumstantial evidence that the linguistic approaches that transformers embody should be, at least, evaluated with greater scrutiny by the linguistics community and, at best, considered to be the currently best available theories.

Why transformers are obviously good models of language

TL;DR

tokens. The discussion also notes limitations such as factual inaccuracies and the need for grounding, calling for future work that integrates robust inference with linguistics-informed theories to advance both AI and linguistic science.

Abstract

Paper Structure (6 sections, 4 figures)

This paper contains 6 sections, 4 figures.

Word embeddings and lexical prototypes
Contextualised word meanings
Contextualised sentence representations
Top-down effects, constructions and (whisper it) syntax
Many local dependencies, but no local biases?
Conclusions

Figures (4)

Figure 1: (a) An example of prototype structure for the concept bird taken from rosch1978principles A robin is considered by most survey respondents to be a more prototypical bird than is a penguin or ostrich. (b) A 2D (t-SNE) mapping of high dimensional word embeddings learned by the word2vec algorithm mikolov2013efficient trained on a large text corpus. Word embeddings can be uncovered in the input weights of any trained neural language model. Here, I propose we consider them as proxies for prototypical meaning. The range of different bird variants depicted in (a) would all be centred conceptually on the word-concept bird in (b). [(b) adapted from https://nlpforhackers.io/word-embeddings]
Figure 2: Early attempts at building on contextualiseed word representations (a) miikkulainen1991natural envisaged a 'hidden layer' where the meaning of words in different semantic and syntactic contexts would be pooled. (b) Elman fed his simple recurrent network sentences and clustered the resulting internal state at the point immediately following words of interest. The result was semantic clusters emerging naturally from the syntactic patterns build into his synthetic word-like input sequences.
Figure 3: --adapted from khalid2021rubert -- A simplified illustration of the extent to which different LM-based systems rely on generalized pre-training vs. task-specific tuning. Circa 2008, pre-training only word (prototype) embeddings (depicted in the darkest shade) was common. Following peters-etal-2018-deep, the computation of meaning in context (illustrated by the next darkest shade of blue) was exploited directly by downstream systems. Shortly after, applying devlin2018bert, which required pretraining an entire transformer network, enabled an even wider class of knowledge transfer from generic text corpora to system-specific tasks. Finally, as of 2019, GPT-tyle approaches radford2019language render pre-training, in some sense at lease, the only game in town. Task-specific behaviour can be achieved in GPT networks with no task-specific weight updates, via a process known as 'prompting'.
Figure 4: As vaswani2017attention show, it is easy to plot attention patterns from trained Transformers. However, this alone does not provide us with any nomonological understanding of how transformers achieve such effective construction of phrasal or sentential meaning, distributed across their many network components, on such a diverse set of tasks or contexts.

Why transformers are obviously good models of language

TL;DR

Abstract

Why transformers are obviously good models of language

Authors

TL;DR

Abstract

Table of Contents

Figures (4)