Table of Contents
Fetching ...

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Alejandro J. Ruiz, Calla Beauregard, Ashley Fehr, Mikaela Irene Fudolig, Bradford Demarest, Yoshi Meke Bird, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds

TL;DR

The paper investigates how tokenization in large language models shapes cognition and meaning, arguing that distributional patterns alone can support minimally viable language (MVP) while token-level structure imposes constraints that mold internal representations. Through three empirical strands—exemplar tokenizations, exemplar vocabularies, and exemplar-token extispicy in RoBERTa—the authors show that tokens surface human-like linguistic units, reveal diverse internal organizations, and expose how tokenization objectives influence cognition and potential biases. They argue for combining distributional foundations with grounding (linguistic and non-linguistic) to deepen meaning, and they caution that current tokenization practices can embed backdoors for bias and harmful content, which alignment alone may not fully remediate. The work advocates revisiting tokenization strategies and architectural assumptions to align model cognition more closely with human language structure and world knowledge, while acknowledging substantial limitations and the need for broader cross-linguistic validation.

Abstract

Tokenization is a necessary component within the current architecture of many language mod-els, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance (particularly with respect to inferential lexical competence), and that the emergence of human-meaningful linguistic units among tokens and current structural constraints motivate changes to existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) vehicles for conveying salient distributional patterns from human language to the model and as (2) semantic primitives. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating suboptimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokens and pretraining can act as a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being arguably meaningfully insulated from the main system intelligence. Finally, we discuss implications for architectural choices, meaning construction, the primacy of language for thought, and LLM cognition. [First uploaded to arXiv in December, 2024.]

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

TL;DR

The paper investigates how tokenization in large language models shapes cognition and meaning, arguing that distributional patterns alone can support minimally viable language (MVP) while token-level structure imposes constraints that mold internal representations. Through three empirical strands—exemplar tokenizations, exemplar vocabularies, and exemplar-token extispicy in RoBERTa—the authors show that tokens surface human-like linguistic units, reveal diverse internal organizations, and expose how tokenization objectives influence cognition and potential biases. They argue for combining distributional foundations with grounding (linguistic and non-linguistic) to deepen meaning, and they caution that current tokenization practices can embed backdoors for bias and harmful content, which alignment alone may not fully remediate. The work advocates revisiting tokenization strategies and architectural assumptions to align model cognition more closely with human language structure and world knowledge, while acknowledging substantial limitations and the need for broader cross-linguistic validation.

Abstract

Tokenization is a necessary component within the current architecture of many language mod-els, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance (particularly with respect to inferential lexical competence), and that the emergence of human-meaningful linguistic units among tokens and current structural constraints motivate changes to existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) vehicles for conveying salient distributional patterns from human language to the model and as (2) semantic primitives. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating suboptimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokens and pretraining can act as a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being arguably meaningfully insulated from the main system intelligence. Finally, we discuss implications for architectural choices, meaning construction, the primacy of language for thought, and LLM cognition. [First uploaded to arXiv in December, 2024.]

Paper Structure

This paper contains 16 sections, 42 figures, 11 tables.

Figures (42)

  • Figure 1: This poster, which was displayed at IC2S2 2024 (10th International Conference on Computational Social Science) ic2s22024event, summarizes our previous paper zimmerman2024blind and includes discussion on grounding, salience, and language as a technology. We list the sources in the QR code here: theludditecraiyonzimmerman2024blindmywebsitechameleonteam2024chameleonmixedmodalearlyfusionfoundationheikkila2023makingkapoor2024largelanguagemodelstaughtptosispetliesenfeld2024rethinkingsutskever2011generatingvaswani2023attentionBolukbasiCZSK16a3blue1brownradhakrishnan2023mechanismfeaturelearningdeepharris1954structurelovato2024foregroundingdoddstarotlarousse1970ancientzimmer2024needunderstandingllmunderstandingfedorenko2024primarilytempleton2024scalinglevintedtalklevin2024stigmergylee2024tasksdodds2023ousiometricstelegnomicsessencemeaningdoddscharacterspacekallini2024missionmahowald2024dissociatinglanguagethoughtlargemaudslay-etal-2024-chainnetfutrell2024linguistichahn2020universalsOpenAI_ChatGPTsep-speech-actsHerbertDuneSeries10.1145/3442188.3445922marconi2003lexicalborges1999collected.
  • Figure 2: Tokens passing through word-like stages as the vocabulary size parameter increases, using a BPE tokenizer and the beginning of Alice in Wonderland. At the top of the figure is a syntax tree for our excerpt from Alice in Wonderland, and the colored paragraph next to it shows how the same excerpt could be tokenized by GPT-4o, underscoring the potential differences in representation between us and LLMs. Syntax tree via treeviewer2023; GPT-4o tokenization via tiktokenizer. The bar charts in the figure show the tokens used in the excerpt according to different tokenization schemas and vocabulary sizes. The height of the bars shows the count of each type of token in the excerpt. The charts on the left side of the figure show linguistically-informed tokenizations (that is, the excerpt if it were tokenized explicitly according to linguistic/ NLP categories, including words and subwords), and the GPT-4o tokenization, noting the similarities between it and the linguistically-informed tokenizations, especially the word-level tokenization. The charts on the right side of the figure show how increasing the vocabulary size causes the average token length to increase, but average count per token to decrease, as the tokens go from looking like individual letters, to subwords, to words. In the 700-token vocabulary, there are mostly words, but some phrases ('she had') and bound morphemes ('ion'). The vocabulary size determines what the tokens will look like, and determines whether or not different linguistic types will be present in the vocabulary. For example, phrases and words will never occur in an extremely small vocabulary (given reasonably representative training data). The vocabulary then determines how the model represents all text as vectors.
  • Figure 3: The tokenization (Y-axis) of the first sentence from Alice in Wonderland -- "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?"' -- by a BPE tokenizer trained on the wikitext dataset and that sentence, as the vocabulary size (X-axis) increases from 2,000 to 30,000, then additional much larger vocabularies (the figure can be explored in a PDF viewer). The tokens at vocabulary size of 1,000 were individual characters; there were around 3 times as many tokens in the sentence as at vocabulary size of 2,000, so imagine a tower on the extreme left that is 3 times as tall as the one shown. Likewise, on the extreme right would eventually be the sentence as a single token. We skipped some vocabulary sizes along the x-axis: it is not evenly spaced (intermediate behaviour between the largest vocabulary sizes was omitted for legibility, but can be reasonably extrapolated by the reader). Middle English translation via openl_middle_english. The short tokens, while combinatorily potent and requiring the model to memorize only a small vocabulary, engender prolific output: the ratio of total tokens per generated text are essentially as high as possible given the source material. On the other hand, the large tokens, while efficient in terms of total tokens per generated text (in the limit, 1:1), are far too specific (the model would need to memorize an essentially infinite vocabulary in order to output general purpose text). In other words, compared to the original sentence, the different tokenizations show how it gets chopped up as we increase the vocabulary size, revealing a tradeoff between tokens as individual characters, in which case, we’d need to encode all meaning in context alone, versus tokens as full sentences without generalizable meaning, in which case, we’d need an infinitely large vocabulary to produce sensible output.
  • Figure 4: The portion of clean tokens out of the total number of clean tokens that are in a given number of Hugging Face vocabulary files, showing the swift decay of inclusion in files, our proxy for likelihood of tokenization.
  • Figure 5: The proportions of parts of speech among the words in CSW19 and GPT-4o. The last chart shows the ratio of the number of words in that category in CSW19 that ended up being tokens in GPT-4o's vocabulary.
  • ...and 37 more figures