Table of Contents
Fetching ...

The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning

Micah Goldblum, Marc Finzi, Keefer Rowan, Andrew Gordon Wilson

TL;DR

The paper argues that No Free Lunch theorems do not constrain real-world learning because natural data are structured and compressible, a property reflected in Kolmogorov complexity. It shows neural networks, including randomly initialized ones, exhibit a bias toward low-complexity labelings and sequences, and that PAC-Bayes bounds can explain cross-domain generalization with a single, flexible learner. By formalizing a Kolmogorov-style NFL and demonstrating universal simplicity biases across domains (vision, language, and tabular data), the authors advocate automated model selection and a unified learning approach that remains effective across data regimes. This perspective supports the trend toward transformer-based architectures and soft inductive biases, reducing the need for extensively tailored models per task.

Abstract

No free lunch theorems for supervised learning state that no learner can solve all problems or that all learners achieve exactly the same accuracy on average over a uniform distribution on learning problems. Accordingly, these theorems are often referenced in support of the notion that individual problems require specially tailored inductive biases. While virtually all uniformly sampled datasets have high complexity, real-world problems disproportionately generate low-complexity data, and we argue that neural network models share this same preference, formalized using Kolmogorov complexity. Notably, we show that architectures designed for a particular domain, such as computer vision, can compress datasets on a variety of seemingly unrelated domains. Our experiments show that pre-trained and even randomly initialized language models prefer to generate low-complexity sequences. Whereas no free lunch theorems seemingly indicate that individual problems require specialized learners, we explain how tasks that often require human intervention such as picking an appropriately sized model when labeled data is scarce or plentiful can be automated into a single learning algorithm. These observations justify the trend in deep learning of unifying seemingly disparate problems with an increasingly small set of machine learning models.

The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning

TL;DR

The paper argues that No Free Lunch theorems do not constrain real-world learning because natural data are structured and compressible, a property reflected in Kolmogorov complexity. It shows neural networks, including randomly initialized ones, exhibit a bias toward low-complexity labelings and sequences, and that PAC-Bayes bounds can explain cross-domain generalization with a single, flexible learner. By formalizing a Kolmogorov-style NFL and demonstrating universal simplicity biases across domains (vision, language, and tabular data), the authors advocate automated model selection and a unified learning approach that remains effective across data regimes. This perspective supports the trend toward transformer-based architectures and soft inductive biases, reducing the need for extensively tailored models per task.

Abstract

No free lunch theorems for supervised learning state that no learner can solve all problems or that all learners achieve exactly the same accuracy on average over a uniform distribution on learning problems. Accordingly, these theorems are often referenced in support of the notion that individual problems require specially tailored inductive biases. While virtually all uniformly sampled datasets have high complexity, real-world problems disproportionately generate low-complexity data, and we argue that neural network models share this same preference, formalized using Kolmogorov complexity. Notably, we show that architectures designed for a particular domain, such as computer vision, can compress datasets on a variety of seemingly unrelated domains. Our experiments show that pre-trained and even randomly initialized language models prefer to generate low-complexity sequences. Whereas no free lunch theorems seemingly indicate that individual problems require specialized learners, we explain how tasks that often require human intervention such as picking an appropriately sized model when labeled data is scarce or plentiful can be automated into a single learning algorithm. These observations justify the trend in deep learning of unifying seemingly disparate problems with an increasingly small set of machine learning models.
Paper Structure (23 sections, 2 theorems, 10 equations, 8 figures, 2 tables)

This paper contains 23 sections, 2 theorems, 10 equations, 8 figures, 2 tables.

Key Result

Theorem 3.1

Let $(X,Y)$ be a dataset with $n$ data points and uniformly sampled random labels from $C$ classes. Then, with probability at least $1-\delta$, for every classifier $p(y|x)$, where CE$(p)$ is the empirical cross entropy of the classifier $p(y|x)$ on the data. Thus for any model of bounded size, if the size of the dataset is large enough, the model cannot represent any classifier with cross entrop

Figures (8)

  • Figure 1: Over time, tasks that were performed by domain-specialized ML systems are increasingly performed by unified neural network architectures. Real-world datasets often exhibit low Kolmogorov complexity. A model that combines a flexible hypothesis space with a simplicity bias towards low Kolmogorov complexity will provide good generalization on many different problems and modalities of data.
  • Figure 2: (Left): Compressed sizes of tabular labels where compression is performed via a trained MLP model (as in \ref{['subsec:nncompressors']}) vs. direct encoding of labels ($n\log_2 C$). (Middle): Compression of image classification datasets using CNNs. Note the breakdown of the total compressed size of the labels into model fit (NLL Bits), compressed parameters (Model Bits), and architecture and decompressor (Code Bits). In both cases, models can greatly compress a diverse suite of datasets, highlighting a common structure shared by models and real-world data. (Right): Compression based generalization bounds kapoor2022 for CNNs on tabular data, fed in with each pixel representing a tabular feature. The bounds are able to explain the majority of the model performance as shown by the test error, indicating that even CNNs designed for computer vision have a generic inductive bias appropriate for a wide range of datasets containing no spatial structure at all.
  • Figure 3: GPT-3 prefers low-complexity sequences generated by expression trees. Left: Average log-probability of sequences by complexity. Right: Average log-probability by sequence length, restricted to decimal digit tokens. GPT-3 variants ordered by increasing size. Observe that GPT-3 variants assign exponentially lower probabilities to higher complexity sequences (left), as in the Solomonoff prior, and bigger more powerful models especially exhibit this behavior. Moreover, the models become more confident as they see more tokens, and the more powerful GPT-3 variants such as Davinci learn faster (right).
  • Figure 4: A single learner, which is more expressive than a ViT but also prefers simple solutions representable by a GoogLeNet, can simultaneously solve small and large scale problems.
  • Figure 5: Randomly initialized GPT-2 Base prefers low-complexity sequences generated by bitstring repetition. Left: Average log-probability of sequences by complexity. Right: Average accuracy by complexity.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Theorem 3.1
  • Theorem 3.1
  • proof