LLMZip: Lossless Text Compression using Large Language Models

Chandra Shekhara Kaushik Valmeekam; Krishna Narayanan; Dileep Kalathil; Jean-Francois Chamberland; Srinivas Shakkottai

LLMZip: Lossless Text Compression using Large Language Models

Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Dileep Kalathil, Jean-Francois Chamberland, Srinivas Shakkottai

TL;DR

This work uses LLaMA-7B as a next-token predictor to derive a tight asymptotic upper bound on English entropy and to build a lossless text compressor. The framework derives $H(oldsymbol{S}) \le \frac{ \lim_{N_T \to \infty} -\frac{1}{N_T}\sum_{i=1}^{N_T} \log_2 q_i(X_i) }{ \mathbb{E}[B] }$ and demonstrates three encoding schemes, including arithmetic coding with time-varying PMFs, to achieve near-optimal performance. Empirically, it reports $H_{ ext{ub}}$ near $0.709$ bits/character for 1MB of text8 and $0.85$ bits/character for 100KB Gutenberg Texas, with arithmetic coding achieving $0.7101$ and $0.8426$ bits/character respectively, outperforming ZPAQ and pq8h baselines on these datasets. The results suggest practical, high-efficiency lossless compression guided by large language models and provide sharper entropy estimates than prior classical bounds.

Abstract

We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. This estimate is significantly smaller than currently available estimates in \cite{cover1978convergent}, \cite{lutati2023focus}. A natural byproduct is an algorithm for lossless compression of English text which combines the prediction from the large language model with a lossless compression scheme. Preliminary results from limited experiments suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h.

LLMZip: Lossless Text Compression using Large Language Models

TL;DR

This work uses LLaMA-7B as a next-token predictor to derive a tight asymptotic upper bound on English entropy and to build a lossless text compressor. The framework derives

and demonstrates three encoding schemes, including arithmetic coding with time-varying PMFs, to achieve near-optimal performance. Empirically, it reports

near

bits/character for 1MB of text8 and

bits/character for 100KB Gutenberg Texas, with arithmetic coding achieving

and

bits/character respectively, outperforming ZPAQ and pq8h baselines on these datasets. The results suggest practical, high-efficiency lossless compression guided by large language models and provide sharper entropy estimates than prior classical bounds.

Abstract

Paper Structure (10 sections, 17 equations, 4 figures, 4 tables)

This paper contains 10 sections, 17 equations, 4 figures, 4 tables.

Introduction
Intuitive explanation of the main idea
Compression using LLMs
Entropy bounds
Encoding schemes
Compressing the ranks using zlib
Token-by-Token Compression
Arithmetic Coding
Results
Acknowledgement

Figures (4)

Figure 1: Schematic showing the prediction at epoch 5 for a language model with memory 4.
Figure 2: Schematic showing the prediction at epoch 6 for a language model with memory 4.
Figure 3: Schematic showing the compression of the sequence of ranks to a bit sequence.
Figure 4: Schematic showing the prediction at epoch $i$.

LLMZip: Lossless Text Compression using Large Language Models

TL;DR

Abstract

LLMZip: Lossless Text Compression using Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)