Table of Contents
Fetching ...

LLMZip: Lossless Text Compression using Large Language Models

Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Dileep Kalathil, Jean-Francois Chamberland, Srinivas Shakkottai

TL;DR

This work uses LLaMA-7B as a next-token predictor to derive a tight asymptotic upper bound on English entropy and to build a lossless text compressor. The framework derives $H(oldsymbol{S}) \le \frac{ \lim_{N_T \to \infty} -\frac{1}{N_T}\sum_{i=1}^{N_T} \log_2 q_i(X_i) }{ \mathbb{E}[B] }$ and demonstrates three encoding schemes, including arithmetic coding with time-varying PMFs, to achieve near-optimal performance. Empirically, it reports $H_{ ext{ub}}$ near $0.709$ bits/character for 1MB of text8 and $0.85$ bits/character for 100KB Gutenberg Texas, with arithmetic coding achieving $0.7101$ and $0.8426$ bits/character respectively, outperforming ZPAQ and pq8h baselines on these datasets. The results suggest practical, high-efficiency lossless compression guided by large language models and provide sharper entropy estimates than prior classical bounds.

Abstract

We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. This estimate is significantly smaller than currently available estimates in \cite{cover1978convergent}, \cite{lutati2023focus}. A natural byproduct is an algorithm for lossless compression of English text which combines the prediction from the large language model with a lossless compression scheme. Preliminary results from limited experiments suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h.

LLMZip: Lossless Text Compression using Large Language Models

TL;DR

This work uses LLaMA-7B as a next-token predictor to derive a tight asymptotic upper bound on English entropy and to build a lossless text compressor. The framework derives and demonstrates three encoding schemes, including arithmetic coding with time-varying PMFs, to achieve near-optimal performance. Empirically, it reports near bits/character for 1MB of text8 and bits/character for 100KB Gutenberg Texas, with arithmetic coding achieving and bits/character respectively, outperforming ZPAQ and pq8h baselines on these datasets. The results suggest practical, high-efficiency lossless compression guided by large language models and provide sharper entropy estimates than prior classical bounds.

Abstract

We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. This estimate is significantly smaller than currently available estimates in \cite{cover1978convergent}, \cite{lutati2023focus}. A natural byproduct is an algorithm for lossless compression of English text which combines the prediction from the large language model with a lossless compression scheme. Preliminary results from limited experiments suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h.
Paper Structure (10 sections, 17 equations, 4 figures, 4 tables)

This paper contains 10 sections, 17 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Schematic showing the prediction at epoch 5 for a language model with memory 4.
  • Figure 2: Schematic showing the prediction at epoch 6 for a language model with memory 4.
  • Figure 3: Schematic showing the compression of the sequence of ranks to a bit sequence.
  • Figure 4: Schematic showing the prediction at epoch $i$.