Table of Contents
Fetching ...

Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness

Zhipeng Yang, Shu Yang, Lijie Hu, Di Wang

TL;DR

A decoding-based method is introduced and a fine-grained attention analysis is conducted, showing that in-group attention among characters belonging to the same canonical token is critical for word recovery: masking such attention in early layers substantially reduces both recovery scores and task performance.

Abstract

Large language models (LLMs) trained with canonical tokenization exhibit surprising robustness to non-canonical inputs such as character-level tokenization, yet the mechanisms underlying this robustness remain unclear. We study this phenomenon through mechanistic interpretability and identify a core process we term word recovery. We first introduce a decoding-based method to detect word recovery, showing that hidden states reconstruct canonical word-level token identities from character-level inputs. We then provide causal evidence by removing the corresponding subspace from hidden states, which consistently degrades downstream task performance. Finally, we conduct a fine-grained attention analysis and show that in-group attention among characters belonging to the same canonical token is critical for word recovery: masking such attention in early layers substantially reduces both recovery scores and task performance. Together, our findings provide a mechanistic explanation for tokenization robustness and identify word recovery as a key mechanism enabling LLMs to process character-level inputs.

Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness

TL;DR

A decoding-based method is introduced and a fine-grained attention analysis is conducted, showing that in-group attention among characters belonging to the same canonical token is critical for word recovery: masking such attention in early layers substantially reduces both recovery scores and task performance.

Abstract

Large language models (LLMs) trained with canonical tokenization exhibit surprising robustness to non-canonical inputs such as character-level tokenization, yet the mechanisms underlying this robustness remain unclear. We study this phenomenon through mechanistic interpretability and identify a core process we term word recovery. We first introduce a decoding-based method to detect word recovery, showing that hidden states reconstruct canonical word-level token identities from character-level inputs. We then provide causal evidence by removing the corresponding subspace from hidden states, which consistently degrades downstream task performance. Finally, we conduct a fine-grained attention analysis and show that in-group attention among characters belonging to the same canonical token is critical for word recovery: masking such attention in early layers substantially reduces both recovery scores and task performance. Together, our findings provide a mechanistic explanation for tokenization robustness and identify word recovery as a key mechanism enabling LLMs to process character-level inputs.
Paper Structure (17 sections, 11 equations, 11 figures, 1 table)

This paper contains 17 sections, 11 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Overview of word recovery in large language models under character-level tokenization. A character-level input (e.g., “What is natural gas?”) is processed by the transformer, where early-layer in-group attention aggregates information among characters belonging to the same canonical token. This enables the model to reconstruct word-level representations in hidden states, which are then used for downstream contextual understanding.
  • Figure 2: Layerwise word recovery under character-level tokenization. We show the word recovery score across four datasets and three models. Recovery patterns are consistent across datasets within each model, while the layerwise behavior of recovery varies across models.
  • Figure 3: Layerwise effects of targeted word-recovery intervention for Gemma-2-9B-It. The line plot shows task performance under targeted intervention applied starting from each transformer layer, while the shaded area shows the corresponding word recovery score at the intervention starting layer under character-level tokenization.
  • Figure 4: Layerwise effects of targeted word-recovery intervention for Qwen2.5-7B-Instruct. The line plot shows task performance under targeted intervention applied starting from each transformer layer, while the shaded area shows the corresponding word recovery score at the intervention starting layer under character-level tokenization.
  • Figure 5: Layerwise effects of targeted word-recovery intervention for Llama-3.2-3B-Instruct. The line plot shows task performance under targeted intervention applied starting from each transformer layer, while the shaded area shows the corresponding word recovery score at the intervention starting layer under character-level tokenization.
  • ...and 6 more figures