Table of Contents
Fetching ...

Frequency-Ordered Tokenization for Better Text Compression

Maximilian Kalcher

TL;DR

Frequency-ordered tokenization is presented, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law) and shows that preprocessing accelerates compression for computationally expensive algorithms.

Abstract

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor. On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform. Gains are consistent at 1 GB scale (enwik9) and across Chinese and Arabic text. We further show that preprocessing accelerates compression for computationally expensive algorithms: the total wall-clock time including preprocessing is 3.1x faster than raw zstd-22 and 2.4x faster than raw LZMA, because the preprocessed input is substantially smaller. The method can be implemented in under 50 lines of code.

Frequency-Ordered Tokenization for Better Text Compression

TL;DR

Frequency-ordered tokenization is presented, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law) and shows that preprocessing accelerates compression for computationally expensive algorithms.

Abstract

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor. On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform. Gains are consistent at 1 GB scale (enwik9) and across Chinese and Arabic text. We further show that preprocessing accelerates compression for computationally expensive algorithms: the total wall-clock time including preprocessing is 3.1x faster than raw zstd-22 and 2.4x faster than raw LZMA, because the preprocessed input is substantially smaller. The method can be implemented in under 50 lines of code.
Paper Structure (22 sections, 3 equations, 3 figures, 8 tables, 1 algorithm)

This paper contains 22 sections, 3 equations, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: Zipf's law on enwik8: word frequency vs. rank (top 100). The top 10 words account for over 20% of all occurrences.
  • Figure 2: (a) Log-log plot of BPE token rank vs. frequency on enwik8, with Zipf fit ($\alpha = 1.04$). (b) Distribution of varint byte lengths before and after frequency reordering.
  • Figure 3: Compression ratio vs. file size. Solid: with preprocessing; dashed: raw.