Table of Contents
Fetching ...

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

Sheridan Feucht, David Atkinson, Byron Wallace, David Bau

TL;DR

This paper investigates how autoregressive LLMs convert arbitrary token groups into lexical items by revealing a pronounced erasure of token information at last-token positions for multi-token words and named entities in early layers. It introduces a read-out framework based on linear probes and an erasure score $\psi_{p,q}$ to quantify forgetting across layers, enabling the extraction of an implicit vocabulary and the segmentation of documents into high-scoring lexical sequences. The study applies the method to Llama-2-7b and Llama-3-8b, showing scalable evidence of lexical item formation, from multi-token words to named entities and code-like expressions, with varying recall across models and datasets. Overall, the work provides a first-step methodology for probing the latent lexical representations in LLMs and offers a concrete tool for identifying what words or expressions a model effectively "knows."

Abstract

LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

TL;DR

This paper investigates how autoregressive LLMs convert arbitrary token groups into lexical items by revealing a pronounced erasure of token information at last-token positions for multi-token words and named entities in early layers. It introduces a read-out framework based on linear probes and an erasure score to quantify forgetting across layers, enabling the extraction of an implicit vocabulary and the segmentation of documents into high-scoring lexical sequences. The study applies the method to Llama-2-7b and Llama-3-8b, showing scalable evidence of lexical item formation, from multi-token words to named entities and code-like expressions, with varying recall across models and datasets. Overall, the work provides a first-step methodology for probing the latent lexical representations in LLMs and offers a concrete tool for identifying what words or expressions a model effectively "knows."

Abstract

LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.
Paper Structure (25 sections, 3 equations, 14 figures, 7 tables, 1 algorithm)

This paper contains 25 sections, 3 equations, 14 figures, 7 tables, 1 algorithm.

Figures (14)

  • Figure 1: We observe "erasure" of token-level information in later layers of LLMs for multi-token words and entities (top). We hypothesize that this is a result of a process that converts token embeddings into useful lexical representations, and introduce a new method for enumerating these lexical items (bottom).
  • Figure 2: Top-1 test accuracy on CounterFact subject last tokens versus other tokens in the dataset for probes trained on Llama-2-7b hidden states ($n=5063$). $i$ represents the position being predicted (e.g., $i=-1$ is previous token prediction; $i=1$ is next-token prediction). We observe an "erasure" effect in last subject tokens that is not present for other types of tokens: these last subject tokens consistently "forget" about preceding tokens and themselves. Appendix \ref{['appendix:probes']} shows Llama-3-8b results and in-distribution performance on Pile tokens.
  • Figure 3: Top-1 test accuracy of probes on last tokens of Wikipedia multi-token words for Llama-2-7b ($n=80606$). Accuracy on all other tokens shown on the left. We see an erasing effect for multi-token words, similar to the effect seen for CounterFact subjects in Figure \ref{['fig:probe-cf']}.
  • Figure 4: Overall test accuracy on unseen Pile tokens ($n=273$k) for probes trained on Llama-2-7b hidden states. Next token prediction becomes more accurate throughout model layers as current and previous token accuracy decreases.
  • Figure 5: Full segmentation of a document from Wikipedia via Algorithm \ref{['algo']} on Llama-2-7b. Borders indicate segmentation, with bolded letters indicating multi-token segments. Darker blue cells have higher scores, yellow cells have negative scores. The highest-scoring sequence in this document is "Australian Institute" ($\psi=0.579$).
  • ...and 9 more figures