Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data
David Heurtel-Depeiges, Anian Ruoss, Joel Veness, Tim Genewein
TL;DR
This study investigates whether small decoder-only transformers trained on raw byte streams can serve as competitive, lossless data compressors once parameter size is accounted for in the compression ratio. By pre-training on 165GB of text, image, and audio data (including multimodal mixtures) and evaluating on 1GB of OOD data, the authors compare against gzip, LZMA2, domain-specific codecs, and an online adaptive transformer. They show that million-parameter models can beat standard compressors on in-modality OOD data and that multimodal training enhances multimodal compression, though transfer to unseen modalities remains weak, highlighting a distinction from very large foundation models. The work provides insights into the inductive biases of small transformers for compression, delineating how model size, context length, and data composition affect domain-general vs. modality-specific performance, and outlining practical limitations for large-scale deployment. The findings offer a nuanced view of compression as a predictor problem and establish a benchmark for evaluating small, multimodal neural compressors. $L_ ho \ge H(\rho)$ and the practical compression ratio is defined as $\frac{|\text{compressed data}| + |\text{compressor}|}{|\text{uncompressed data}|}$, illustrating the trade-off between predictive accuracy and model-parameter costs.
Abstract
Foundation models are strong data compressors, but when accounting for their parameter size, their compression ratios are inferior to standard compression algorithms. Naively reducing the parameter count does not necessarily help as it deteriorates predictions and, accordingly, compression. We conduct a large-scale empirical study to find a sweet spot where pre-trained vanilla transformers can achieve competitive compression ratios. To this end, we train models on 165GB of raw byte sequences of either text, image, or audio data (and all possible combinations of the three) and then compress 1GB of out-of-distribution (OOD) data from each modality. We find that relatively small models (millions of parameters) can outperform standard general-purpose compression algorithms (gzip, LZMA2) and even domain-specific compressors (PNG, JPEG-XL, FLAC) $\unicode{x2013}$ even when accounting for parameter size. We achieve, e.g., the lowest compression ratio of 0.49 on OOD audio data (vs. 0.54 for FLAC). We conduct extensive ablations and hyperparameter sweeps to study the impact of model- and dataset scale, and we investigate the effect of unimodal versus multimodal training. We find that even small models can be trained to perform well on multiple modalities, but unlike large-scale foundation models, transfer to unseen modalities is generally weak.
