Table of Contents
Fetching ...

Length-MAX Tokenizer for Language Models

Dong Dong, Weijie Su

TL;DR

Length-MAX introduces a length-weighted tokenizer that maximizes $freq(t) \cdot |t|$ to reduce tokens-per-character and improve efficiency, formulating the problem as NP-hard graph partitioning solved with an $\mathcal{O}(N)$ greedy, scoreboard-based algorithm. The approach yields substantial end-to-end gains: 14–18% fewer tokens, 18.5% faster convergence, 13.7% lower latency, and 16% higher throughput in GPT-2-scale training, while preserving or improving downstream tasks such as LAMBADA and HellaSwag, and achieving 99.62% vocabulary coverage with low OOV. A production-ready pipeline uses Rabin-Karp enumeration, DFA-based decoding, and near-linear CPU scalability, enabling cross-domain efficiency with minimal architectural changes. The work also documents Zipf-aligned token distributions and analyzes robustness, memory footprint reductions, and scaling behavior, suggesting Length-MAX as a complementary, production-friendly tokenization paradigm that can coexist with boundary-aware or token-free approaches.

Abstract

We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7\% and enhancing HellaSwag accuracy by 4.3\%. Moreover, the Length-MAX tokenizer achieves 99.62\% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12\% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18\% at inference.

Length-MAX Tokenizer for Language Models

TL;DR

Length-MAX introduces a length-weighted tokenizer that maximizes to reduce tokens-per-character and improve efficiency, formulating the problem as NP-hard graph partitioning solved with an greedy, scoreboard-based algorithm. The approach yields substantial end-to-end gains: 14–18% fewer tokens, 18.5% faster convergence, 13.7% lower latency, and 16% higher throughput in GPT-2-scale training, while preserving or improving downstream tasks such as LAMBADA and HellaSwag, and achieving 99.62% vocabulary coverage with low OOV. A production-ready pipeline uses Rabin-Karp enumeration, DFA-based decoding, and near-linear CPU scalability, enabling cross-domain efficiency with minimal architectural changes. The work also documents Zipf-aligned token distributions and analyzes robustness, memory footprint reductions, and scaling behavior, suggesting Length-MAX as a complementary, production-friendly tokenization paradigm that can coexist with boundary-aware or token-free approaches.

Abstract

We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7\% and enhancing HellaSwag accuracy by 4.3\%. Moreover, the Length-MAX tokenizer achieves 99.62\% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12\% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18\% at inference.

Paper Structure

This paper contains 59 sections, 4 equations, 11 figures, 20 tables.

Figures (11)

  • Figure 1: Scaling Trends of Long-Range Understanding for Length-MAX vs. BPE. The figure compares Length-MAX (blue line) against a standard BPE baseline (red line) across three model sizes (124M, 355M, and 1.3B) and their corresponding optimal training data sizes. The four subplots show: (a) Training Efficiency, measured in thousands of training steps to reach a target loss (lower is better); (b) Tokenization Efficiency for a 50k vocabulary, measured in TPC (lower is better); (c) Long-Range Reasoning, measured by MNLI accuracy (higher is better); and (d) Long-Range Dependency, measured by LAMBADA perplexity (lower is better). Gray boxes indicate the relative improvement of Length-MAX over the BPE baseline at each model size, showing that advantages are substantial and persist across scales.
  • Figure 2: Integrated view of the LLM training pipeline (top) and the internal Length-MAX tokenizer workflow (bottom). The tokenizer converts raw shards into a tokenised corpus via a scoreboard-based greedy loop (e.g., grouping "the_United_States"), after which standard Transformer training proceeds.
  • Figure 3: Toy graph before (left) and after (right) partition. Left panel shows all pairwise edges (grey dashed) with weights; right panel retains only intra-cluster edges after applying our graph partition.
  • Figure 4: TPC across tokenizers and vocabulary sizes on the FineWeb10B training corpus. Length-MAX consistently achieves better compression than frequency-based baselines.
  • Figure 5: Scaling trends for training steps (left) and inference latency (right) across model sizes. Solid points show measured results at 124M, 355M, and 1.3B parameters (mean$\pm$std over five runs); dashed line shows FLOPs-based analytical prediction at 7B. Length-MAX maintains consistent relative gains across scales.
  • ...and 6 more figures