Table of Contents
Fetching ...

Problematic Tokens: Tokenizer Bias in Large Language Models

Jin Yang, Zhiqiang Wang, Yanbin Lin, Zunduo Zhao

TL;DR

The tokenizer’s vocabulary, often used to speed up the tokenization process and reduce tokens but constructed independently of the actual model training data, inadequately represents non-English languages, which perpetuate biases and pose serious concerns related to data security and ethical standards.

Abstract

Recent advancements in large language models(LLMs), such as GPT-4 and GPT-4o, have shown exceptional performance, especially in languages with abundant resources like English, thanks to extensive datasets that ensure robust training. Conversely, these models exhibit limitations when processing under-resourced languages such as Chinese and Korean, where issues including hallucinatory responses remain prevalent. This paper traces the roots of these disparities to the tokenization process inherent to these models. Specifically, it explores how the tokenizers vocabulary, often used to speed up the tokenization process and reduce tokens but constructed independently of the actual model training data, inadequately represents non-English languages. This misrepresentation results in the propagation of under-trained or untrained tokens, which perpetuate biases and pose serious concerns related to data security and ethical standards. We aim to dissect the tokenization mechanics of GPT-4o, illustrating how its simplified token-handling methods amplify these risks and offer strategic solutions to mitigate associated security and ethical issues. Through this study, we emphasize the critical need to rethink tokenization frameworks to foster more equitable and secure AI technologies. The code and data are available at https://github.com/yeyimilk/LLMGPT4o

Problematic Tokens: Tokenizer Bias in Large Language Models

TL;DR

The tokenizer’s vocabulary, often used to speed up the tokenization process and reduce tokens but constructed independently of the actual model training data, inadequately represents non-English languages, which perpetuate biases and pose serious concerns related to data security and ethical standards.

Abstract

Recent advancements in large language models(LLMs), such as GPT-4 and GPT-4o, have shown exceptional performance, especially in languages with abundant resources like English, thanks to extensive datasets that ensure robust training. Conversely, these models exhibit limitations when processing under-resourced languages such as Chinese and Korean, where issues including hallucinatory responses remain prevalent. This paper traces the roots of these disparities to the tokenization process inherent to these models. Specifically, it explores how the tokenizers vocabulary, often used to speed up the tokenization process and reduce tokens but constructed independently of the actual model training data, inadequately represents non-English languages. This misrepresentation results in the propagation of under-trained or untrained tokens, which perpetuate biases and pose serious concerns related to data security and ethical standards. We aim to dissect the tokenization mechanics of GPT-4o, illustrating how its simplified token-handling methods amplify these risks and offer strategic solutions to mitigate associated security and ethical issues. Through this study, we emphasize the critical need to rethink tokenization frameworks to foster more equitable and secure AI technologies. The code and data are available at https://github.com/yeyimilk/LLMGPT4o
Paper Structure (24 sections, 1 equation, 6 figures, 5 tables)

This paper contains 24 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The tokenization process differs between GPT-4 and GPT-4o, as they use distinct tokenizers for the same Chinese text, resulting in varying tokens. The English text was translated by human beings for reference.
  • Figure 2: Motivating example. GPT-4o was not able to understand a common phrase used in Chinese scenario.
  • Figure 3: The figure shows token samples from GPT-4o tokenizer, o200k_base, and their corresponding classification based on their content.
  • Figure 4: General workflow for determining problematic tokens.
  • Figure 5: Sample sentences generated by GPT-4 and GPT-4o using long Chinese token from tiktoken-o200k_base (with the leading space removed) and its corresponding shorter tokens, along with human-assigned scores. The English text was translated by human beings for reference.
  • ...and 1 more figures