Tokenization Falling Short: On Subword Robustness in Large Language Models

Yekun Chai; Yewei Fang; Qiwei Peng; Xuhong Li

Tokenization Falling Short: On Subword Robustness in Large Language Models

Yekun Chai, Yewei Fang, Qiwei Peng, Xuhong Li

TL;DR

The findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations and experiments show that subword regularization such as BPE-dropout can mitigate this issue.

Abstract

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens--issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We release our evaluation code and data at https://github.com/FloatAI/TKEval.

Tokenization Falling Short: On Subword Robustness in Large Language Models

TL;DR

Abstract

Paper Structure (58 sections, 11 figures, 3 tables)

This paper contains 58 sections, 11 figures, 3 tables.

Introduction
Contribution
Related Work
Tokenization
Tokenization Approach
Tokenization-Free Approach
Perturbation Probing
Complex Problem Solving
Anagram Task
Task Description and Settings
Results and Analysis
Mathematical Language (LaTeX) Comprehension
Results and Analysis
Token Structure Probe
Intra-Token Probing
...and 43 more sections

Figures (11)

Figure 1: Compositional challenges in token embeddings. (a) "assignment" decomposed into "assign" and "ment" shows a cosine similarity of 0.21 and an angle of 78.16°. (b) "import" decomposed into "im" and "port" shows a cosine similarity of 0.13 and an angle of 82.47°. These results indicate that existing LLMs do not accurately capture surface form composition.
Figure 2: $K$-shot performance on Word Unscrambling (WU) and Cycled Letters (CL) tasks. The plots illustrate that increasing the number of demonstration examples ($K$-shot) does not consistently enhance performance. However, models with larger parameter sizes generally exhibit better performance across both tasks.
Figure 3: The relationship between the length of scrambled words and the Exact Match (EM) score of Llama3-8B and Llama3-70B on the word unscrambling task under one-shot evaluation. The models tend to correctly reorder anagrams of shorter lengths, while struggling with longer words.
Figure 4: $K$-shot performance on intra-token probing tasks (CCV, CC, NC, NCR). The plots demonstrate that increasing the number of demonstration examples ($K$-shot) generally results in an improvement from zero-shot to one-shot, with performance stabilizing thereafter.
Figure 5: $K$-shot performance on various inter-token probing tasks. For edit distances, lower is better.
...and 6 more figures

Tokenization Falling Short: On Subword Robustness in Large Language Models

TL;DR

Abstract

Tokenization Falling Short: On Subword Robustness in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)