Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data

David Heurtel-Depeiges; Anian Ruoss; Joel Veness; Tim Genewein

Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data

David Heurtel-Depeiges, Anian Ruoss, Joel Veness, Tim Genewein

TL;DR

This study investigates whether small decoder-only transformers trained on raw byte streams can serve as competitive, lossless data compressors once parameter size is accounted for in the compression ratio. By pre-training on 165GB of text, image, and audio data (including multimodal mixtures) and evaluating on 1GB of OOD data, the authors compare against gzip, LZMA2, domain-specific codecs, and an online adaptive transformer. They show that million-parameter models can beat standard compressors on in-modality OOD data and that multimodal training enhances multimodal compression, though transfer to unseen modalities remains weak, highlighting a distinction from very large foundation models. The work provides insights into the inductive biases of small transformers for compression, delineating how model size, context length, and data composition affect domain-general vs. modality-specific performance, and outlining practical limitations for large-scale deployment. The findings offer a nuanced view of compression as a predictor problem and establish a benchmark for evaluating small, multimodal neural compressors. $L_ ho \ge H(\rho)$ and the practical compression ratio is defined as $\frac{|\text{compressed data}| + |\text{compressor}|}{|\text{uncompressed data}|}$, illustrating the trade-off between predictive accuracy and model-parameter costs.

Abstract

Foundation models are strong data compressors, but when accounting for their parameter size, their compression ratios are inferior to standard compression algorithms. Naively reducing the parameter count does not necessarily help as it deteriorates predictions and, accordingly, compression. We conduct a large-scale empirical study to find a sweet spot where pre-trained vanilla transformers can achieve competitive compression ratios. To this end, we train models on 165GB of raw byte sequences of either text, image, or audio data (and all possible combinations of the three) and then compress 1GB of out-of-distribution (OOD) data from each modality. We find that relatively small models (millions of parameters) can outperform standard general-purpose compression algorithms (gzip, LZMA2) and even domain-specific compressors (PNG, JPEG-XL, FLAC) $\unicode{x2013}$ even when accounting for parameter size. We achieve, e.g., the lowest compression ratio of 0.49 on OOD audio data (vs. 0.54 for FLAC). We conduct extensive ablations and hyperparameter sweeps to study the impact of model- and dataset scale, and we investigate the effect of unimodal versus multimodal training. We find that even small models can be trained to perform well on multiple modalities, but unlike large-scale foundation models, transfer to unseen modalities is generally weak.

Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data

TL;DR

and the practical compression ratio is defined as

, illustrating the trade-off between predictive accuracy and model-parameter costs.

Abstract

even when accounting for parameter size. We achieve, e.g., the lowest compression ratio of 0.49 on OOD audio data (vs. 0.54 for FLAC). We conduct extensive ablations and hyperparameter sweeps to study the impact of model- and dataset scale, and we investigate the effect of unimodal versus multimodal training. We find that even small models can be trained to perform well on multiple modalities, but unlike large-scale foundation models, transfer to unseen modalities is generally weak.

Paper Structure (45 sections, 3 equations, 8 figures, 4 tables)

This paper contains 45 sections, 3 equations, 8 figures, 4 tables.

Introduction
Main Contributions
Background
Related Work
Compression Without Transformers
Online Transformers
Pre-Trained Transformers
Methods
Baselines
Models
(No) Tokenization
Evaluation
Training Datasets
OOD Evaluation Datasets
Results
...and 30 more sections

Figures (8)

Figure 1: Our training and evaluation data pipelines. We consider three modalities: text, images, and audio. From these we create training mixtures of $165$GB that are either unimodal or multimodal. After pre-training transformers on each of these, we evaluate their compression ratio (factoring in the model size) on all three modalities. If the corresponding modality has not been seen during training, the evaluation is 'out-of-modality', otherwise it is 'in-modality'. Importantly, our evaluation is always performed on out-of-distribution data (different from any of the training data sources), even when it is in-modality.
Figure 2: Small pre-trained transformers are domain-general compressors (panels: evaluation data mixtures, bars: training data mixtures). Our method (bars) outperforms standard compression algorithms (horizontal lines) and is on par with the online adaptive transformers from bellard2021nncp (blue line) --- as long as the evaluation modality is in the training mixture. There is very little cross-modal transfer to unseen modalities (unlike foundation models deletang2024language). Unimodal models are good for their respective modality, but multimodal models perform almost as well across all their training modalities (despite seeing much less data per modality than the unimodal models), i.e., one can trade off a small amount of performance on each individual modality to obtain a strong domain-general compressor via multimodal training (gray bar).
Figure 3: What you see is what you get. Each panel visualizes the compression ratios for one of our modalities when training models on varying dataset mixtures and sizes. Although one can replace a large proportion of the unimodal training datasets with data from other modalities without incurring significant losses on the original modality (note the scale of the y-axis), transformers (at our tested model sizes) do not exhibit improved transfer from the out-of-modality data (i.e., the multimodal models are worse than the unimodal ones, even when trained on much more data from that particular modality). Nevertheless, multimodal training data significantly improves multimodal compression performance (as shown in \ref{['fig:in-vs-out-of-modality']}).
Figure 4: Scaling training dataset- and model size (for unimodal training and evaluation). Colors indicate the model size; lines correspond to dataset size. We train for $2$ epochs regardless of dataset size (i.e., smaller datasets require fewer FLOPS). Increasing the model- and dataset size boosts compression (at the cost of FLOPS). Our OOD evaluation makes models more prone to overfitting (e.g., our largest image models), making scaling more complex than traditional LLM scaling laws.
Figure 5: Context- vs. model size. Both context size (measure in bytes) and model sizes affect the training compute budget (in FLOPS), leading to a non-trivial trade-off. Our results show that this trade-off is highly modality-dependent (note the different y-axis scales, i.e., the effect varies significantly with modality). For text, shorter context sizes and larger models are beneficial (short-term dependencies are most important). For images, larger context is generally beneficial, given that a single image consists of $512 \cdot 512 \cdot 3 = 786432$ bytes, far exceeding our models' contexts, i.e., models with larger context can process larger fractions of an image at once. For audio, the relationship is complex with intermediate context length and larger models performing better (the reverse is true for short contexts).
...and 3 more figures

Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data

TL;DR

Abstract

Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data

Authors

TL;DR

Abstract

Table of Contents

Figures (8)