Table of Contents
Fetching ...

Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

Eugene Jang, Kimin Lee, Jin-Woo Chung, Keuntae Park, Seungwon Shin

TL;DR

This work reveals a vulnerability in byte-level BPE tokenizers: incomplete tokens, which rely heavily on neighboring tokens, can be adversarially exploited via improbable bigrams to induce hallucinations in several instruction-tuned LLMs. The authors construct such bigrams and show they cause significantly more hallucinations than baselines with complete tokens, even for well-trained constituents. They demonstrate that alternative tokenization, such as presegmenting the target phrases, markedly reduces these hallucinations, suggesting the root cause lies in the incomplete-token structure. The findings imply practical mitigations—such as vocabulary pruning, enforcing character-boundary-aware merges, or adopting character-level tokenization—to reduce tokenization-induced blind spots and improve model robustness in real-world deployments.

Abstract

Tokenization is a crucial step that bridges human-readable text with model-readable discrete tokens. However, recent studies have revealed that tokenizers can be exploited to elicit unwanted model behaviors. In this work, we investigate incomplete tokens, i.e., undecodable tokens with stray bytes resulting from byte-level byte-pair encoding (BPE) tokenization. We hypothesize that such tokens are heavily reliant on their adjacent tokens and are fragile when paired with unfamiliar tokens. To demonstrate this vulnerability, we introduce improbable bigrams: out-of-distribution combinations of incomplete tokens designed to exploit their dependency. Our experiments show that improbable bigrams are significantly prone to hallucinatory behaviors. Surprisingly, the same phrases have drastically lower rates of hallucination (90% reduction in Llama3.1) when an alternative tokenization is used. We caution against the potential vulnerabilities introduced by byte-level BPE tokenizers, which may introduce blind spots to language models.

Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

TL;DR

This work reveals a vulnerability in byte-level BPE tokenizers: incomplete tokens, which rely heavily on neighboring tokens, can be adversarially exploited via improbable bigrams to induce hallucinations in several instruction-tuned LLMs. The authors construct such bigrams and show they cause significantly more hallucinations than baselines with complete tokens, even for well-trained constituents. They demonstrate that alternative tokenization, such as presegmenting the target phrases, markedly reduces these hallucinations, suggesting the root cause lies in the incomplete-token structure. The findings imply practical mitigations—such as vocabulary pruning, enforcing character-boundary-aware merges, or adopting character-level tokenization—to reduce tokenization-induced blind spots and improve model robustness in real-world deployments.

Abstract

Tokenization is a crucial step that bridges human-readable text with model-readable discrete tokens. However, recent studies have revealed that tokenizers can be exploited to elicit unwanted model behaviors. In this work, we investigate incomplete tokens, i.e., undecodable tokens with stray bytes resulting from byte-level byte-pair encoding (BPE) tokenization. We hypothesize that such tokens are heavily reliant on their adjacent tokens and are fragile when paired with unfamiliar tokens. To demonstrate this vulnerability, we introduce improbable bigrams: out-of-distribution combinations of incomplete tokens designed to exploit their dependency. Our experiments show that improbable bigrams are significantly prone to hallucinatory behaviors. Surprisingly, the same phrases have drastically lower rates of hallucination (90% reduction in Llama3.1) when an alternative tokenization is used. We caution against the potential vulnerabilities introduced by byte-level BPE tokenizers, which may introduce blind spots to language models.

Paper Structure

This paper contains 9 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: An improbable bigram phrase that combines two incomplete tokens to cause hallucinatory behaviors in the Qwen2.5 model. This behavior persists across multiple models and with well-trained tokens. An alternative tokenization of the same phrase does not cause hallucinations.
  • Figure 2: Incomplete tokens are analyzed for their structure to find potential bigram candidates. C-bytes (continuation bytes) are colored in blue. Tokens with compatible structures are tested for viability and reachability through an decode-encode check.
  • Figure 3: Language-pair distributions of the improbable bigrams used in experiments. Darkened colors indicate improbable bigrams that cause hallucinations.