Table of Contents
Fetching ...

Lossless data compression by large models

Ziguang Li, Chao Huang, Xuliang Wang, Haibo Hu, Cole Wyeth, Dongbo Bu, Quan Yu, Wen Gao, Xingwu Liu, Ming Li

TL;DR

This paper proposes LMCompress, a lossless data compression framework that harnesses autoregressive large models to approximate Solomonoff induction by tokenizing data, predicting token distributions, and applying arithmetic coding. It demonstrates substantial, cross-domain gains in compression ratios for images, videos (lossless and lossy), audio, and domain-specific text, outperforming traditional codecs and baseline LLM-based methods. The key insight is that richer data understanding from large models translates directly into better compression efficiency, potentially enabling high-bandwidth transmission in future networks. The work suggests a new Kolmogorov-inspired paradigm for compression with practical implications for 6G communications, while outlining future directions such as inter-frame lossless video coding and retrieval-augmented generation integration.

Abstract

Modern data compression methods are slowly reaching their limits after 80 years of research, millions of papers, and wide range of applications. Yet, the extravagant 6G communication speed requirement raises a major open question for revolutionary new ideas of data compression. We have previously shown all understanding or learning are compression, under reasonable assumptions. Large language models (LLMs) understand data better than ever before. Can they help us to compress data? The LLMs may be seen to approximate the uncomputable Solomonoff induction. Therefore, under this new uncomputable paradigm, we present LMCompress. LMCompress shatters all previous lossless compression algorithms, doubling the lossless compression ratios of JPEG-XL for images, FLAC for audios, and H.264 for videos, and quadrupling the compression ratio of bz2 for texts. The better a large model understands the data, the better LMCompress compresses.

Lossless data compression by large models

TL;DR

This paper proposes LMCompress, a lossless data compression framework that harnesses autoregressive large models to approximate Solomonoff induction by tokenizing data, predicting token distributions, and applying arithmetic coding. It demonstrates substantial, cross-domain gains in compression ratios for images, videos (lossless and lossy), audio, and domain-specific text, outperforming traditional codecs and baseline LLM-based methods. The key insight is that richer data understanding from large models translates directly into better compression efficiency, potentially enabling high-bandwidth transmission in future networks. The work suggests a new Kolmogorov-inspired paradigm for compression with practical implications for 6G communications, while outlining future directions such as inter-frame lossless video coding and retrieval-augmented generation integration.

Abstract

Modern data compression methods are slowly reaching their limits after 80 years of research, millions of papers, and wide range of applications. Yet, the extravagant 6G communication speed requirement raises a major open question for revolutionary new ideas of data compression. We have previously shown all understanding or learning are compression, under reasonable assumptions. Large language models (LLMs) understand data better than ever before. Can they help us to compress data? The LLMs may be seen to approximate the uncomputable Solomonoff induction. Therefore, under this new uncomputable paradigm, we present LMCompress. LMCompress shatters all previous lossless compression algorithms, doubling the lossless compression ratios of JPEG-XL for images, FLAC for audios, and H.264 for videos, and quadrupling the compression ratio of bz2 for texts. The better a large model understands the data, the better LMCompress compresses.
Paper Structure (16 sections, 4 figures, 3 tables)

This paper contains 16 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The architecture of our LMCompress. First, the original data is transformed into a sequence of tokens. Then, this token sequence is fed into a generative large model, which outputs the predictive distribution for each token. Finally, arithmetic coding losslessly compresses the original data based on the predictive distributions. The tokenization module and the generative large model may vary according to the type of the data.
  • Figure 2: Image compression ratios
  • Figure 3: Lossless video compression ratios of the state-of-the-art approaches and our LMCompress. Dataset: Xiph.org videos classified into "static scene" and "dynamic scene"
  • Figure 4: Text compression ratios. Dataset: MeDAL and Pile of Law. LLaMA3-8B means the text compressor in huang2023approximating with LLaMA2-7B replaced by LLaMA3-8B