Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning
Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Alejandro J. Ruiz, Calla Beauregard, Ashley Fehr, Mikaela Irene Fudolig, Bradford Demarest, Yoshi Meke Bird, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds
TL;DR
The paper investigates how tokenization in large language models shapes cognition and meaning, arguing that distributional patterns alone can support minimally viable language (MVP) while token-level structure imposes constraints that mold internal representations. Through three empirical strands—exemplar tokenizations, exemplar vocabularies, and exemplar-token extispicy in RoBERTa—the authors show that tokens surface human-like linguistic units, reveal diverse internal organizations, and expose how tokenization objectives influence cognition and potential biases. They argue for combining distributional foundations with grounding (linguistic and non-linguistic) to deepen meaning, and they caution that current tokenization practices can embed backdoors for bias and harmful content, which alignment alone may not fully remediate. The work advocates revisiting tokenization strategies and architectural assumptions to align model cognition more closely with human language structure and world knowledge, while acknowledging substantial limitations and the need for broader cross-linguistic validation.
Abstract
Tokenization is a necessary component within the current architecture of many language mod-els, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance (particularly with respect to inferential lexical competence), and that the emergence of human-meaningful linguistic units among tokens and current structural constraints motivate changes to existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) vehicles for conveying salient distributional patterns from human language to the model and as (2) semantic primitives. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating suboptimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokens and pretraining can act as a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being arguably meaningfully insulated from the main system intelligence. Finally, we discuss implications for architectural choices, meaning construction, the primacy of language for thought, and LLM cognition. [First uploaded to arXiv in December, 2024.]
