Length-MAX Tokenizer for Language Models
Dong Dong, Weijie Su
TL;DR
Length-MAX introduces a length-weighted tokenizer that maximizes $freq(t) \cdot |t|$ to reduce tokens-per-character and improve efficiency, formulating the problem as NP-hard graph partitioning solved with an $\mathcal{O}(N)$ greedy, scoreboard-based algorithm. The approach yields substantial end-to-end gains: 14–18% fewer tokens, 18.5% faster convergence, 13.7% lower latency, and 16% higher throughput in GPT-2-scale training, while preserving or improving downstream tasks such as LAMBADA and HellaSwag, and achieving 99.62% vocabulary coverage with low OOV. A production-ready pipeline uses Rabin-Karp enumeration, DFA-based decoding, and near-linear CPU scalability, enabling cross-domain efficiency with minimal architectural changes. The work also documents Zipf-aligned token distributions and analyzes robustness, memory footprint reductions, and scaling behavior, suggesting Length-MAX as a complementary, production-friendly tokenization paradigm that can coexist with boundary-aware or token-free approaches.
Abstract
We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7\% and enhancing HellaSwag accuracy by 4.3\%. Moreover, the Length-MAX tokenizer achieves 99.62\% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12\% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18\% at inference.
