Table of Contents
Fetching ...

The KoLMogorov Test: Compression by Code Generation

Ori Yoran, Kunhao Zheng, Fabian Gloeckle, Jonas Gehring, Gabriel Synnaeve, Taco Cohen

TL;DR

The KoLMogorov-Test (KT) reframes compression as a reasoning task for code-generating LLMs by asking models to produce the shortest program that generates a given sequence, aligning with Kolmogorov complexity $K(x)$. It combines naturally occurring data across audio, text, and DNA with a synthetic, DSL-based data generator to create an essentially infinite set of problems whose compression length cannot be gamed by memorization. Zero-shot prompting of state-of-the-art models yields high error rates, while training dedicated SeqCoder models on synthetic data yields substantial gains on that distribution, though generalization to real data remains limited. The work highlights the need for new innovations in reasoning, search, and priors to advance KT, and it provides reproducibility assets and a roadmap for scaling the benchmark.

Abstract

Compression is at the heart of intelligence. A theoretically optimal way to compress any sequence of data is to find the shortest program that outputs that sequence and then halts. However, such 'Kolmogorov compression' is uncomputable, and code generating LLMs struggle to approximate this theoretical ideal, as it requires reasoning, planning and search capabilities beyond those of current models. In this work, we introduce the KoLMogorov-Test (KT), a compression-as-intelligence test for code generating LLMs. In KT a model is presented with a sequence of data at inference time, and asked to generate the shortest program that produces the sequence. We identify several benefits of KT for both evaluation and training: an essentially infinite number of problem instances of varying difficulty is readily available, strong baselines already exist, the evaluation metric (compression) cannot be gamed, and pretraining data contamination is highly unlikely. To evaluate current models, we use audio, text, and DNA data, as well as sequences produced by random synthetic programs. Current flagship models perform poorly - both GPT4-o and Llama-3.1-405B struggle on our natural and synthetic sequences. On our synthetic distribution, we are able to train code generation models with lower compression rates than previous approaches. Moreover, we show that gains on synthetic data generalize poorly to real data, suggesting that new innovations are necessary for additional gains on KT.

The KoLMogorov Test: Compression by Code Generation

TL;DR

The KoLMogorov-Test (KT) reframes compression as a reasoning task for code-generating LLMs by asking models to produce the shortest program that generates a given sequence, aligning with Kolmogorov complexity . It combines naturally occurring data across audio, text, and DNA with a synthetic, DSL-based data generator to create an essentially infinite set of problems whose compression length cannot be gamed by memorization. Zero-shot prompting of state-of-the-art models yields high error rates, while training dedicated SeqCoder models on synthetic data yields substantial gains on that distribution, though generalization to real data remains limited. The work highlights the need for new innovations in reasoning, search, and priors to advance KT, and it provides reproducibility assets and a roadmap for scaling the benchmark.

Abstract

Compression is at the heart of intelligence. A theoretically optimal way to compress any sequence of data is to find the shortest program that outputs that sequence and then halts. However, such 'Kolmogorov compression' is uncomputable, and code generating LLMs struggle to approximate this theoretical ideal, as it requires reasoning, planning and search capabilities beyond those of current models. In this work, we introduce the KoLMogorov-Test (KT), a compression-as-intelligence test for code generating LLMs. In KT a model is presented with a sequence of data at inference time, and asked to generate the shortest program that produces the sequence. We identify several benefits of KT for both evaluation and training: an essentially infinite number of problem instances of varying difficulty is readily available, strong baselines already exist, the evaluation metric (compression) cannot be gamed, and pretraining data contamination is highly unlikely. To evaluate current models, we use audio, text, and DNA data, as well as sequences produced by random synthetic programs. Current flagship models perform poorly - both GPT4-o and Llama-3.1-405B struggle on our natural and synthetic sequences. On our synthetic distribution, we are able to train code generation models with lower compression rates than previous approaches. Moreover, we show that gains on synthetic data generalize poorly to real data, suggesting that new innovations are necessary for additional gains on KT.

Paper Structure

This paper contains 64 sections, 2 equations, 19 figures, 10 tables, 2 algorithms.

Figures (19)

  • Figure 1: Data compression by code generation. Consider compressing a sequence of bytes (presented as numbers in range $[0, 255]$) that can be produced by composing simpler sub-sequences. Standard compression methods, such as Gzip, focous on repetitions and frequency of characters and fail to exploit the logical patterns in this sequence (although they are strong baselines for long sequences, §\ref{['subsec:analysis']}). LLMs are better at finding complex patterns, such as a sequence of incremental numbers, and can be used for compression with arithmetic coding. However, they are sensitive to phase-shifts due to their auto-regressive manner, and require model weights for decoding. Code generative models, inspired by the concept of Kolmogorov Complexity, can identify patterns in the input sequence to generate concise programs whose execution produces the original sequence.
  • Figure 2: Our main experimental settings.
  • Figure 3: Two examples of program-sequence pairs from our synthetic data generation process.
  • Figure 4: CompressionRate for SeqCoder models trained on 10K-1M program-sequence pairs. Models trained on enough data outperform the baselines.
  • Figure 5: Accuracy for our SeqCoder-1.5B on Audio MFCC sequences of lengths 16-128. Models trained on more examples are significantly better on short sequences, but all models struggle on longer ones.
  • ...and 14 more figures