Table of Contents
Fetching ...

Neural networks are a priori biased towards Boolean functions with low entropy

Chris Mingard, Joar Skalse, Guillermo Valle-Pérez, David Martínez-Rubio, Vladimir Mikulik, Ard A. Louis

TL;DR

This work quantifies the inductive bias of neural networks by studying the a priori distribution $P(f)$ over Boolean functions produced from random initializations. It proves a sharp result for a zero-bias perceptron: the count of positive outputs $T$ is uniformly distributed with $P(T=t)=2^{-n}$, which yields a strong bias toward low-entropy, and often simpler, functions; it then extends the analysis to multi-layer ReLU networks, showing that depth and bias terms amplify this bias. The authors connect initialization bias to learning dynamics and generalization, provide expressivity bounds, and offer empirical evidence that the phenomenon persists in more realistic settings. Overall, the paper provides a mechanistic explanation for why highly overparameterised networks tend to favor simple functions and suggests architectural choices to tune inductive bias for better generalization.

Abstract

Understanding the inductive bias of neural networks is critical to explaining their ability to generalise. Here, for one of the simplest neural networks -- a single-layer perceptron with n input neurons, one output neuron, and no threshold bias term -- we prove that upon random initialisation of weights, the a priori probability $P(t)$ that it represents a Boolean function that classifies t points in ${0,1}^n$ as 1 has a remarkably simple form: $P(t) = 2^{-n}$ for $0\leq t < 2^n$. Since a perceptron can express far fewer Boolean functions with small or large values of t (low entropy) than with intermediate values of t (high entropy) there is, on average, a strong intrinsic a-priori bias towards individual functions with low entropy. Furthermore, within a class of functions with fixed t, we often observe a further intrinsic bias towards functions of lower complexity. Finally, we prove that, regardless of the distribution of inputs, the bias towards low entropy becomes monotonically stronger upon adding ReLU layers, and empirically show that increasing the variance of the bias term has a similar effect.

Neural networks are a priori biased towards Boolean functions with low entropy

TL;DR

This work quantifies the inductive bias of neural networks by studying the a priori distribution over Boolean functions produced from random initializations. It proves a sharp result for a zero-bias perceptron: the count of positive outputs is uniformly distributed with , which yields a strong bias toward low-entropy, and often simpler, functions; it then extends the analysis to multi-layer ReLU networks, showing that depth and bias terms amplify this bias. The authors connect initialization bias to learning dynamics and generalization, provide expressivity bounds, and offer empirical evidence that the phenomenon persists in more realistic settings. Overall, the paper provides a mechanistic explanation for why highly overparameterised networks tend to favor simple functions and suggests architectural choices to tune inductive bias for better generalization.

Abstract

Understanding the inductive bias of neural networks is critical to explaining their ability to generalise. Here, for one of the simplest neural networks -- a single-layer perceptron with n input neurons, one output neuron, and no threshold bias term -- we prove that upon random initialisation of weights, the a priori probability that it represents a Boolean function that classifies t points in as 1 has a remarkably simple form: for . Since a perceptron can express far fewer Boolean functions with small or large values of t (low entropy) than with intermediate values of t (high entropy) there is, on average, a strong intrinsic a-priori bias towards individual functions with low entropy. Furthermore, within a class of functions with fixed t, we often observe a further intrinsic bias towards functions of lower complexity. Finally, we prove that, regardless of the distribution of inputs, the bias towards low entropy becomes monotonically stronger upon adding ReLU layers, and empirically show that increasing the variance of the bias term has a similar effect.

Paper Structure

This paper contains 27 sections, 19 theorems, 55 equations, 16 figures.

Key Result

Theorem 4.1

For a perceptron $f_{\theta}$ with $b=0$ and weights $w$ sampled from a distribution which is symmetric under reflections along the coordinate axes, the probability measure $P(\theta:\mathcal{T}(f_\theta)=t)$ is given by

Figures (16)

  • Figure 1: Probability $P(f)$ that a function obtains upon random choice of parameters versus Lempel Ziv complexity $K_{LZ}(f)$ for (a) an $n=7$ perceptron with $b=0$ and weights sampled from a Gaussian distributions, (b) an $n=7$ perceptron with $b=0$ and weights sampled from a uniform distribution centred at $0$ and (c) a 1-hidden layer neural network (with 64 neurons in the hidden layer). Weights $w$ and the threshold bias terms are sampled from $\mathcal{N}(0,1)$. For all cases $10^8$ samples were taken and frequencies less than 2 were eliminated to reduce finite sampling effects. We present the graphs with the same scale for ease of comparison.
  • Figure 2: Probability $P(f)$ vs rank for functions for a perceptron with $n=7$, $\sigma_b=0$, and weights sampled from independent Gaussian distributions. In \ref{['fig:rankplot47', 'fig:rankplot64']} the functions are ranked within their respective ${\mathbb{F}}_t$. The seven highest probability functions in \ref{['fig:rankplot64']} are $f=0101\dots$ and equivalent functions obtained by permuting the input dimensions -- note that these are very simple functions (simpler than the simplest functions that satisfy $\mathcal{T}(f)=47$).
  • Figure 3: Effect of adding a bias term sampled from $\mathcal{N}(0,\sigma_b)$ to a perceptron with weights sampled from $\mathcal{N}(0,1)$. (a) Increasing $\sigma_b$ increases the bias against entropy, and with a particular strong bias towards $t=0$ and $t=2^n$. (b) $P(t=0)$ increases with $\sigma_b$ and asymptotes to $1/2$ in the limit $\sigma_b \rightarrow \infty$.
  • Figure 4: $\bf P(T=t)$ becomes on average more biased towards low entropy for increasing number of layers or increasing $\bf \sigma_b$. Here we use $n=7$ input layers, with input $\{0,1\}^7$ (centered data) or $\{-1,1\}^7$ (uncentered data) The hidden layers are of width $2^{n-1} = 64$ to guarantee full expressivity. $\sigma_w=1.0$ in all cases. The insets show how $P(t=0)$ asymptotes to $\frac{1}{2}$ with increasing layers or $\sigma_b$.
  • Figure 5: Probability vs rank for functions (ranked by probability) from samples of size $10^8$, with input size $n=7$, and every weight and bias term sampled from $\mathcal{N}(0,1)$ unless otherwise specified, over initialisations of: (a) a perceptron with $b=0$; (b) a perceptron; (c) a one-hidden layer neural network (with 64 neurons in the hidden layer); (d) a perceptron with $b=0$ and weights sampled from identical centered uniform distributions (note how similar (a) is to (d)!). We cut off frequencies less than $2$ to eliminate finite size effects. In (a) and (b) lines were fitted using least-squares regression; for (c) the line corresponding to the ansatz in \ref{['eqn:N0']} is plotted instead.
  • ...and 11 more figures

Theorems & Definitions (45)

  • Definition 3.1: DNNs
  • Definition 3.2: Parameter-function map
  • Definition 3.3
  • Definition 3.4: ${\mathbb{F}}_t$ and $P(t)$
  • Definition 3.5
  • Definition 3.6
  • Theorem 4.1
  • proof : Proof sketch
  • Lemma 5.1
  • Lemma 5.2
  • ...and 35 more