Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection

Yuxi Li; Yi Liu; Gelei Deng; Ying Zhang; Wenjia Song; Ling Shi; Kailong Wang; Yuekang Li; Yang Liu; Haoyu Wang

Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection

Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, Haoyu Wang

TL;DR

This paper introduces glitch tokens—anomalous tokens produced by tokenizers that disrupt LLM outputs—and presents a large-scale empirical study across seven popular LLMs and three tokenizers. It proposes a five-type glitch-token taxonomy and five symptom categories, supported by 7,895 identified glitch tokens and real-world prevalence in major datasets. The authors then introduce GlitchHunter, an iterative, embedding-space clustering detector built on a Token Embedding Graph and powered by the Leiden algorithm, achieving up to 100% precision and significantly reducing search cost versus full vocabulary traversal. They validate GlitchHunter on eight open-source LLMs, showing substantial efficiency gains (roughly 72–80% reduction in resource usage) and strong effectiveness (average precision 99.44% and recall 63.20%), outperforming three baselines. The work provides practical methods for detecting and mitigating tokenization-related glitches, with implications for improving the robustness and trustworthiness of LLM deployments in real-world settings.

Abstract

With the expanding application of Large Language Models (LLMs) in various domains, it becomes imperative to comprehensively investigate their unforeseen behaviors and consequent outcomes. In this study, we introduce and systematically explore the phenomenon of "glitch tokens", which are anomalous tokens produced by established tokenizers and could potentially compromise the models' quality of response. Specifically, we experiment on seven top popular LLMs utilizing three distinct tokenizers and involving a totally of 182,517 tokens. We present categorizations of the identified glitch tokens and symptoms exhibited by LLMs when interacting with glitch tokens. Based on our observation that glitch tokens tend to cluster in the embedding space, we propose GlitchHunter, a novel iterative clustering-based technique, for efficient glitch token detection. The evaluation shows that our approach notably outperforms three baseline methods on eight open-source LLMs. To the best of our knowledge, we present the first comprehensive study on glitch tokens. Our new detection further provides valuable insights into mitigating tokenization-related errors in LLMs.

Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection

TL;DR

Abstract

Paper Structure (30 sections, 8 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 30 sections, 8 equations, 5 figures, 8 tables, 1 algorithm.

Introduction
Background
Token and Tokenization in LLMs
Glitch Token
Motivating Example
Empirical Study Methodology
Dataset Collection
Data Labelling
Empirical Study Result
RQ1 (Symptom): What are the unexpected behaviors caused by glitch tokens in LLMs?
RQ2 (Glitch Token Type): What are the common types of glitch tokens in LLMs?
RQ3 (Real-world Analysis): What is the frequency of glitch tokens in real-world datasets?
Implications of Our Findings
Efficient Glitch Token Detection (RQ4)
Initial TEG Building
...and 15 more sections

Figures (5)

Figure 1: Workflow of A Typical Language Model Based on A Normal Tokenizer. The process starts with an input sentence, "Jack is a boy, Jane is a," which is fed into the tokenizer. This tokenizer breaks down the input into smaller chunks or tokens, as represented by the "Tokenize" stage. The tokenized input is then embedded, transforming the tokens into vectors suitable for the language model. The embedded input is processed by the language model, which generates a set of probabilities for potential next words or tokens. The "Decode" stage then interprets these probabilities to produce the final output, in this case, the word "girl." The overall output completes the sentence as "Jack is a boy, Jane is a girl." The entire process is visualized with arrows and labeled boxes, highlighting the flow from input to output.
Figure 2: A Motivating Example on Token " TheNitrome"
Figure 3: Venn Graph of Different Tokenizers
Figure 4: UMAP Visualization of the Llama2-7b-chat token set: Letters A-E denote five glitch categories from Table \ref{['tab:glitch_type']}; 'Normal' labels non-glitch tokens. Dashed lines outline glitch token clustering.
Figure 5: Overall Workflow of GlitchHunter

Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection

TL;DR

Abstract

Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)