Frequency-Ordered Tokenization for Better Text Compression

Maximilian Kalcher

Frequency-Ordered Tokenization for Better Text Compression

Maximilian Kalcher

TL;DR

Frequency-ordered tokenization is presented, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law) and shows that preprocessing accelerates compression for computationally expensive algorithms.

Abstract

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor. On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform. Gains are consistent at 1 GB scale (enwik9) and across Chinese and Arabic text. We further show that preprocessing accelerates compression for computationally expensive algorithms: the total wall-clock time including preprocessing is 3.1x faster than raw zstd-22 and 2.4x faster than raw LZMA, because the preprocessed input is substantially smaller. The method can be implemented in under 50 lines of code.

Frequency-Ordered Tokenization for Better Text Compression

TL;DR

Abstract

Paper Structure (22 sections, 3 equations, 3 figures, 8 tables, 1 algorithm)

This paper contains 22 sections, 3 equations, 3 figures, 8 tables, 1 algorithm.

Introduction
Background and Related Work
Zipf's Law and Variable-Length Encoding
Byte Pair Encoding
Compression Algorithms
Related Work
Method
Frequency-Ordered Tokenization
Information-Theoretic Analysis
Experiments
Setup
Main Results
Ablation: Tokenization vs. Reordering
Zipf's Law Verification
File Size Scaling
...and 7 more sections

Figures (3)

Figure 1: Zipf's law on enwik8: word frequency vs. rank (top 100). The top 10 words account for over 20% of all occurrences.
Figure 2: (a) Log-log plot of BPE token rank vs. frequency on enwik8, with Zipf fit ($\alpha = 1.04$). (b) Distribution of varint byte lengths before and after frequency reordering.
Figure 3: Compression ratio vs. file size. Solid: with preprocessing; dashed: raw.

Frequency-Ordered Tokenization for Better Text Compression

TL;DR

Abstract

Frequency-Ordered Tokenization for Better Text Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (3)