A Philosophical Introduction to Language Models -- Part I: Continuity With Classic Debates

Raphaël Millière; Cameron Buckner

A Philosophical Introduction to Language Models -- Part I: Continuity With Classic Debates

Raphaël Millière, Cameron Buckner

TL;DR

The paper addresses whether large language models instantiate genuine linguistic and cognitive competence or merely mimic sophisticated behavior. It surveys historical foundations and transformer architectures, framing the discussion around core philosophical issues: compositionality, language acquisition, grounding, world representations, and cultural transmission, while warning against misapplied reductive inferences. It finds that LLMs challenge several traditional assumptions and exhibit notable generalization and learning-from-context capabilities, yet robust semantic grounding and stable communicative intentions remain unresolved. The work argues for empirically grounded inquiry into internal representations and world-model-like knowledge, setting the stage for Part II’s empirical probing and new philosophical questions.

Abstract

Large language models like GPT-4 have achieved remarkable proficiency in a broad spectrum of language-based tasks, some of which are traditionally associated with hallmarks of human intelligence. This has prompted ongoing disagreements about the extent to which we can meaningfully ascribe any kind of linguistic or cognitive competence to language models. Such questions have deep philosophical roots, echoing longstanding debates about the status of artificial neural networks as cognitive models. This article -- the first part of two companion papers -- serves both as a primer on language models for philosophers, and as an opinionated survey of their significance in relation to classic debates in the philosophy cognitive science, artificial intelligence, and linguistics. We cover topics such as compositionality, language acquisition, semantic competence, grounding, world models, and the transmission of cultural knowledge. We argue that the success of language models challenges several long-held assumptions about artificial neural networks. However, we also highlight the need for further empirical investigation to better understand their internal mechanisms. This sets the stage for the companion paper (Part II), which turns to novel empirical methods for probing the inner workings of language models, and new philosophical questions prompted by their latest developments.

A Philosophical Introduction to Language Models -- Part I: Continuity With Classic Debates

TL;DR

Abstract

Paper Structure (11 sections, 3 figures, 1 table)

This paper contains 11 sections, 3 figures, 1 table.

Introduction
A primer on LLMs
Historical foundations
Transformer-based LLMs
Interface with classic philosophical issues
Compositionality
Nativism and language acquisition
Language understanding and grounding
World models
Transmission of cultural knowledge and linguistic scaffolding
Conclusion

Figures (3)

Figure 1: An illustration of word embeddings in a multidimensional vector space.A. A word embedding model trained on a natural language corpus learns to encode words into numerical (or embeddings) in a multidimensional space (simplified to two dimensions for visual clarity). Over the course of training, vectors for contextually related words (such as 'age' and 'epoch') become more similar, while vectors for contextually unrelated words (such as 'age' and 'coffee') become less similar. B. Word embeddings in the two-dimensional vector space of a trained model. Words with similar meanings ('age' and 'epoch') are positioned closer together, as indicated by their high cosine similarity score, whereas words with dissimilar meanings ('coffee' and 'epoch') are further apart, reflected in a lower cosine similarity score. Cosine similarity is a measure used to determine the cosine of the angle between two non-zero vectors, providing an indication of the degree to which they are similar. A cosine similarity score closer to 1 indicates a smaller angle and thus a higher degree of similarity between the vectors. Figure loosely adapted from boledaDistributionalSemanticsLinguistic2020.
Figure 2: A. The autoregressive Transformer architecture of LLMs. Tokens from the input sequence are first embedded as , which involves converting each token into a high-dimensional space where semantically similar tokens have correspondingly similar . Positional encoding adds information about the position of each token in the input sequence to the . These enriched are then processed through successive blocks. Each block consists of multiple attention heads that process all in parallel, and a fully-connected feedforward layer, also known as a multilayer perceptron (MLP) layer. Finally, in the unembedding stage, the undergo a linear transformation to project them into a vocabulary-sized space, producing a set of . These represent the unnormalized scores for each potential next token in the vocabulary. A softmax layer is then applied to convert these into a probability distribution over the vocabulary, indicating the comparative likelihood of each token being the next in the sequence. During the training process, the correct next token is known and used for backpropagation, whereas during inference, the model predicts the next token without this information. This process can be repeated iteratively in an autoregressive manner for each token prediction to generate more than one token. B. The mechanism visualized. Each attention head assigns a weight or attention score to each token $t_i$ for every token $t_{0-i}$ in the sequence up to and including $t_i$. Here, each red line represents the attention score between 'of' and every other token in the input sequence, including itself. In this example, the attention score quantifies the relevance or importance of each token with respect to the token 'of', with thicker lines indicating higher scores. This pattern exemplifies how the attention mechanism allows the model to dynamically focus on different parts of the input sequence to derive a contextually nuanced representation of each token. The attention pattern is different for every head, because that each head specializes during training in selectively attending to specific kinds of dependencies between tokens.
Figure 3: Examples of inputs and outputs from the SCAN dataset lakeGeneralizationSystematicityCompositional2018 with an illustrative .

A Philosophical Introduction to Language Models -- Part I: Continuity With Classic Debates

TL;DR

Abstract

A Philosophical Introduction to Language Models -- Part I: Continuity With Classic Debates

Authors

TL;DR

Abstract

Table of Contents

Figures (3)