Neural networks are a priori biased towards Boolean functions with low entropy
Chris Mingard, Joar Skalse, Guillermo Valle-Pérez, David Martínez-Rubio, Vladimir Mikulik, Ard A. Louis
TL;DR
This work quantifies the inductive bias of neural networks by studying the a priori distribution $P(f)$ over Boolean functions produced from random initializations. It proves a sharp result for a zero-bias perceptron: the count of positive outputs $T$ is uniformly distributed with $P(T=t)=2^{-n}$, which yields a strong bias toward low-entropy, and often simpler, functions; it then extends the analysis to multi-layer ReLU networks, showing that depth and bias terms amplify this bias. The authors connect initialization bias to learning dynamics and generalization, provide expressivity bounds, and offer empirical evidence that the phenomenon persists in more realistic settings. Overall, the paper provides a mechanistic explanation for why highly overparameterised networks tend to favor simple functions and suggests architectural choices to tune inductive bias for better generalization.
Abstract
Understanding the inductive bias of neural networks is critical to explaining their ability to generalise. Here, for one of the simplest neural networks -- a single-layer perceptron with n input neurons, one output neuron, and no threshold bias term -- we prove that upon random initialisation of weights, the a priori probability $P(t)$ that it represents a Boolean function that classifies t points in ${0,1}^n$ as 1 has a remarkably simple form: $P(t) = 2^{-n}$ for $0\leq t < 2^n$. Since a perceptron can express far fewer Boolean functions with small or large values of t (low entropy) than with intermediate values of t (high entropy) there is, on average, a strong intrinsic a-priori bias towards individual functions with low entropy. Furthermore, within a class of functions with fixed t, we often observe a further intrinsic bias towards functions of lower complexity. Finally, we prove that, regardless of the distribution of inputs, the bias towards low entropy becomes monotonically stronger upon adding ReLU layers, and empirically show that increasing the variance of the bias term has a similar effect.
