LLMZip: Lossless Text Compression using Large Language Models
Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Dileep Kalathil, Jean-Francois Chamberland, Srinivas Shakkottai
TL;DR
This work uses LLaMA-7B as a next-token predictor to derive a tight asymptotic upper bound on English entropy and to build a lossless text compressor. The framework derives $H(oldsymbol{S}) \le \frac{ \lim_{N_T \to \infty} -\frac{1}{N_T}\sum_{i=1}^{N_T} \log_2 q_i(X_i) }{ \mathbb{E}[B] }$ and demonstrates three encoding schemes, including arithmetic coding with time-varying PMFs, to achieve near-optimal performance. Empirically, it reports $H_{ ext{ub}}$ near $0.709$ bits/character for 1MB of text8 and $0.85$ bits/character for 100KB Gutenberg Texas, with arithmetic coding achieving $0.7101$ and $0.8426$ bits/character respectively, outperforming ZPAQ and pq8h baselines on these datasets. The results suggest practical, high-efficiency lossless compression guided by large language models and provide sharper entropy estimates than prior classical bounds.
Abstract
We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. This estimate is significantly smaller than currently available estimates in \cite{cover1978convergent}, \cite{lutati2023focus}. A natural byproduct is an algorithm for lossless compression of English text which combines the prediction from the large language model with a lossless compression scheme. Preliminary results from limited experiments suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h.
