Table of Contents
Fetching ...

A Compression Perspective on Simplicity Bias

Tom Marty, Eric Elmoznino, Leo Gagnon, Tejas Kasetty, Mizu Nishikawa-Toomey, Sarthak Mittal, Guillaume Lajoie, Dhanya Sridhar

Abstract

Deep neural networks exhibit a simplicity bias, a well-documented tendency to favor simple functions over complex ones. In this work, we cast new light on this phenomenon through the lens of the Minimum Description Length principle, formalizing supervised learning as a problem of optimal two-part lossless compression. Our theory explains how simplicity bias governs feature selection in neural networks through a fundamental trade-off between model complexity (the cost of describing the hypothesis) and predictive power (the cost of describing the data). Our framework predicts that as the amount of available training data increases, learners transition through qualitatively different features -- from simple spurious shortcuts to complex features -- only when the reduction in data encoding cost justifies the increased model complexity. Consequently, we identify distinct data regimes where increasing data promotes robustness by ruling out trivial shortcuts, and conversely, regimes where limiting data can act as a form of complexity-based regularization, preventing the learning of unreliable complex environmental cues. We validate our theory on a semi-synthetic benchmark showing that the feature selection of neural networks follows the same trajectory of solutions as optimal two-part compressors.

A Compression Perspective on Simplicity Bias

Abstract

Deep neural networks exhibit a simplicity bias, a well-documented tendency to favor simple functions over complex ones. In this work, we cast new light on this phenomenon through the lens of the Minimum Description Length principle, formalizing supervised learning as a problem of optimal two-part lossless compression. Our theory explains how simplicity bias governs feature selection in neural networks through a fundamental trade-off between model complexity (the cost of describing the hypothesis) and predictive power (the cost of describing the data). Our framework predicts that as the amount of available training data increases, learners transition through qualitatively different features -- from simple spurious shortcuts to complex features -- only when the reduction in data encoding cost justifies the increased model complexity. Consequently, we identify distinct data regimes where increasing data promotes robustness by ruling out trivial shortcuts, and conversely, regimes where limiting data can act as a form of complexity-based regularization, preventing the learning of unreliable complex environmental cues. We validate our theory on a semi-synthetic benchmark showing that the feature selection of neural networks follows the same trajectory of solutions as optimal two-part compressors.

Paper Structure

This paper contains 41 sections, 9 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: (a) In a supervised setting, some features causally relate to the label (e.g., phenotypical characteristics), while others are spurious (e.g., background of the image). A Bayesian solution leverages all available features. (b) Total expected compression cost as a function of training dataset size $N$. Each line represents the two-part description length of a candidate model. The MDL-optimal learner selects the model achieving the lowest total cost at each $N$, inducing transitions between qualitatively different solutions as the amount of data increases.
  • Figure 2: Comparison of the compression view and the learning view for Scenario A and B. (a) Compression envelopes on a log-log scale: each line shows the total two-part description length of a candidate predictor as a function of dataset size $N$. (b) Accuracy on held-out datasets as a function of $N$: solid lines are empirical measurements, dashed lines are reported metrics of the MDL-optimal solution. (c) Permutation feature importance (accuracy drop after shuffling each feature): a higher drop indicates stronger reliance on that feature.
  • Figure 3: Empirical vs. theoretical transition point $N$ for varying task characteristics. Each point is one experimental configuration; the dashed line is the identity $N_{\text{theory}} = N_{\text{empirical}}$. (a) Reducing the predictiveness of a feature increases its compression rate disadvantage, causing the transition to occur earlier. (b) Increasing the complexity of a feature inflates its description cost, causing the transition to occur earlier. In both cases, the proximity to the diagonal indicates that our MDL-based theory accurately predicts the observed feature switch as measured by our proxy.
  • Figure C.1: Example samples from the EMNIST-based benchmark we curated, featuring the three types of features we presented in \ref{['subsec:robust']}: the digit shape (robust feature), the color (simple spurious feature) and the watermark (complex environmental-cue). The label is determined by the digit value, while the color and watermark are correlated with the label in an environment-dependent way.