Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
Aaditya K. Singh, DJ Strouse
TL;DR
This work reveals that number tokenization direction (L2R vs R2L) materially shapes arithmetic performance in frontier LLMs, with R2L (enforced via right-aligned digit grouping) yielding substantial gains over L2R. Through controlled delimited inputs and multiple ablations, the authors show that the observed advantages are primarily driven by tokenization direction rather than delimiter semantics or added thinking tokens. Detailed error analyses uncover systematic, length-mismatch and token-boundary failures, including a pronounced digit-4 misprediction pattern under L2R. The study also demonstrates that models can be prompted to convert L2R inputs into R2L outputs, recovering performance and suggesting avenues for mitigating tokenization biases in numerical reasoning. Collectively, the results urge careful ablations of number tokenization in model training and evaluation to better understand and control inductive biases in arithmetic tasks.
Abstract
Tokenization, the division of input text into input tokens, is an often overlooked aspect of the large language model (LLM) pipeline and could be the source of useful or harmful inductive biases. Historically, LLMs have relied on byte pair encoding, without care to specific input domains. With the increased use of LLMs for reasoning, various number-specific tokenization schemes have been adopted, with popular models like LLaMa and PaLM opting for single-digit tokenization while GPT-3.5 and GPT-4 have separate tokens for each 1-, 2-, and 3-digit numbers. In this work, we study the effect this choice has on numerical reasoning through the use of arithmetic tasks. We consider left-to-right and right-to-left tokenization for GPT-3.5 and -4, finding that right-to-left tokenization (enforced by comma separating numbers at inference time) leads to largely improved performance. Furthermore, we find that model errors when using standard left-to-right tokenization follow stereotyped error patterns, suggesting that model computations are systematic rather than approximate. We show that the model is able to convert between tokenizations easily, thus allowing chain-of-thought-inspired approaches to recover performance on left-to-right tokenized inputs. We also find the gap between tokenization directions decreases when models are scaled, possibly indicating that larger models are better able to override this tokenization-dependent inductive bias. In summary, our work performs the first study of how number tokenization choices lead to differences in model performance on arithmetic tasks, accompanied by a thorough analysis of error patterns. We hope this work inspires practitioners to more carefully ablate number tokenization-related choices when working towards general models of numerical reasoning.
