Table of Contents
Fetching ...

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

Aaditya K. Singh, DJ Strouse

TL;DR

This work reveals that number tokenization direction (L2R vs R2L) materially shapes arithmetic performance in frontier LLMs, with R2L (enforced via right-aligned digit grouping) yielding substantial gains over L2R. Through controlled delimited inputs and multiple ablations, the authors show that the observed advantages are primarily driven by tokenization direction rather than delimiter semantics or added thinking tokens. Detailed error analyses uncover systematic, length-mismatch and token-boundary failures, including a pronounced digit-4 misprediction pattern under L2R. The study also demonstrates that models can be prompted to convert L2R inputs into R2L outputs, recovering performance and suggesting avenues for mitigating tokenization biases in numerical reasoning. Collectively, the results urge careful ablations of number tokenization in model training and evaluation to better understand and control inductive biases in arithmetic tasks.

Abstract

Tokenization, the division of input text into input tokens, is an often overlooked aspect of the large language model (LLM) pipeline and could be the source of useful or harmful inductive biases. Historically, LLMs have relied on byte pair encoding, without care to specific input domains. With the increased use of LLMs for reasoning, various number-specific tokenization schemes have been adopted, with popular models like LLaMa and PaLM opting for single-digit tokenization while GPT-3.5 and GPT-4 have separate tokens for each 1-, 2-, and 3-digit numbers. In this work, we study the effect this choice has on numerical reasoning through the use of arithmetic tasks. We consider left-to-right and right-to-left tokenization for GPT-3.5 and -4, finding that right-to-left tokenization (enforced by comma separating numbers at inference time) leads to largely improved performance. Furthermore, we find that model errors when using standard left-to-right tokenization follow stereotyped error patterns, suggesting that model computations are systematic rather than approximate. We show that the model is able to convert between tokenizations easily, thus allowing chain-of-thought-inspired approaches to recover performance on left-to-right tokenized inputs. We also find the gap between tokenization directions decreases when models are scaled, possibly indicating that larger models are better able to override this tokenization-dependent inductive bias. In summary, our work performs the first study of how number tokenization choices lead to differences in model performance on arithmetic tasks, accompanied by a thorough analysis of error patterns. We hope this work inspires practitioners to more carefully ablate number tokenization-related choices when working towards general models of numerical reasoning.

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

TL;DR

This work reveals that number tokenization direction (L2R vs R2L) materially shapes arithmetic performance in frontier LLMs, with R2L (enforced via right-aligned digit grouping) yielding substantial gains over L2R. Through controlled delimited inputs and multiple ablations, the authors show that the observed advantages are primarily driven by tokenization direction rather than delimiter semantics or added thinking tokens. Detailed error analyses uncover systematic, length-mismatch and token-boundary failures, including a pronounced digit-4 misprediction pattern under L2R. The study also demonstrates that models can be prompted to convert L2R inputs into R2L outputs, recovering performance and suggesting avenues for mitigating tokenization biases in numerical reasoning. Collectively, the results urge careful ablations of number tokenization in model training and evaluation to better understand and control inductive biases in arithmetic tasks.

Abstract

Tokenization, the division of input text into input tokens, is an often overlooked aspect of the large language model (LLM) pipeline and could be the source of useful or harmful inductive biases. Historically, LLMs have relied on byte pair encoding, without care to specific input domains. With the increased use of LLMs for reasoning, various number-specific tokenization schemes have been adopted, with popular models like LLaMa and PaLM opting for single-digit tokenization while GPT-3.5 and GPT-4 have separate tokens for each 1-, 2-, and 3-digit numbers. In this work, we study the effect this choice has on numerical reasoning through the use of arithmetic tasks. We consider left-to-right and right-to-left tokenization for GPT-3.5 and -4, finding that right-to-left tokenization (enforced by comma separating numbers at inference time) leads to largely improved performance. Furthermore, we find that model errors when using standard left-to-right tokenization follow stereotyped error patterns, suggesting that model computations are systematic rather than approximate. We show that the model is able to convert between tokenizations easily, thus allowing chain-of-thought-inspired approaches to recover performance on left-to-right tokenized inputs. We also find the gap between tokenization directions decreases when models are scaled, possibly indicating that larger models are better able to override this tokenization-dependent inductive bias. In summary, our work performs the first study of how number tokenization choices lead to differences in model performance on arithmetic tasks, accompanied by a thorough analysis of error patterns. We hope this work inspires practitioners to more carefully ablate number tokenization-related choices when working towards general models of numerical reasoning.
Paper Structure (27 sections, 1 equation, 18 figures, 1 table)

This paper contains 27 sections, 1 equation, 18 figures, 1 table.

Figures (18)

  • Figure 1: Illustrating the dependence of frontier model arithmetic performance on tokenization. We show how using commas can enforce right-to-left (R2L) tokenization for the same addition problem. R2L tokenization leads to improved model performance on both GPT-3.5 and GPT-4 (March 2023 models), which we show is due to tokenization alignment between addends and answer through various controls and error analyses.
  • Figure 2: All 3-digit strings, colored red when the string does not have a corresponding single token in p50k_base, the BPE tokenizer for GPT-3. Though there's some patterns (e.g., nearly all multiples of 10 are present), overall there's no clear structure. The missing tokens are an artifact of the specific process BPE tokenizers use to establish vocabularies.
  • Figure 3: Comparison of how p50k_base, the tokenizer for GPT-3, and cl100k_base, the tokenizer for GPT-3.5 and GPT-4, segments 4 digit strings into tokens. cl100k_base standardized number tokenization to chunks of 3 digits, left-to-right, resulting in all N-digit numbers being segmented the same way.
  • Figure 4: Effect of R2L vs L2R tokenization with increasing shots.
  • Figure 5: 8-shot accuracy when using different delimiters for R2L tokenization. Dotted lines show results from Figure \ref{['fig:main_result']} for comparison. Overall, we see choice of delimiter matters less than direction of tokenization.
  • ...and 13 more figures