Deep-layered machines have a built-in Occam's razor

Thomas M. A. Fink

Deep-layered machines have a built-in Occam's razor

Thomas M. A. Fink

TL;DR

An exact theory for the distribution of outputs is given: a deep-layered machine in which every node is a Boolean function of all the nodes below it, which suggests that deep-layered machines and other learning methodologies may be inherently biased towards simplicity in the models that they generate.

Abstract

Input-output maps are prevalent throughout science and technology. They are empirically observed to be biased towards simple outputs, but we don't understand why. To address this puzzle, we study the archetypal input-output map: a deep-layered machine in which every node is a Boolean function of all the nodes below it. We give an exact theory for the distribution of outputs, and we confirm our predictions through extensive computer experiments. As the network depth increases, the distribution becomes exponentially biased towards simple outputs. This suggests that deep-layered machines and other learning methodologies may be inherently biased towards simplicity in the models that they generate.

Deep-layered machines have a built-in Occam's razor

TL;DR

Abstract

Paper Structure (25 equations, 4 figures, 1 table)

This paper contains 25 equations, 4 figures, 1 table.

Figures (4)

Figure 1: Deep-layered machines. In a network of $k$ arguments ($a$, $b, \ldots$), each logic depends on all $k$ of the arguments below it, each of which depends on the $k$ arguments below it, and so on, down to $n$ levels. Our goal is to determine the probability distribution of $f(a, b, \ldots)$ (the output) given a random assignment of logics to $f$; $g_1, g_2, \ldots$; and so on (the input).
Figure 2: Inputs-output map for two arguments. For $k = 2$ arguments and network depth $n = 1$, there are $16^3$ inputs but only 16 outputs. The outputs are indicated by the gray level, from light to dark, which correspond to false to true in the order given in Table \ref{['k23Distribution']} Top. The inputs are the choices of logics for $f$, $g_1$ and $g_2$ in $f(g_1(a,b),g_2(a,b))$. Each panel is a different choice of $f$, and within each panel are the $16 \times 16$ choices of $g_1$ and $g_2$.
Figure 3: Computer experiments confirm our theory. We compare our prediction of the probability $\mathbf{q}(n)$ of the output function (lines) with computer experiments (points), for various values of the number of arguments $k$ and the network depth $n$. The vertical axis shows the probability of an output with a given Hamming weight $w$, since outputs with the same $w$ have the same probability. In all cases, our experiments agree with our theory exactly or, when sampling, to within statistical significance. A For $k=2$, we enumerated all of the input configurations up to network depth $n = 4$. As $n$ increases, the distribution of the output function flattens out and falls. But for true and false ($w = 4$ and $w = 0$), the probabilities approach $1/2$. B For $k = 3$, we show exact results for $n = 0$ and 1, and sample the inputs for $n = 2$ and 3. We show our $n = 4$ theory for comparison. C For $k = 4$, we show exact results for $n = 0$, and sample the inputs for $n = 1$ and 2. We also show our $n = 3$ and 4 theory.
Figure 4: Probability of an output versus its information content. The vertical axis is the probability of an output function and the horizontal axis is its information content given by eq. (\ref{['InfoContent']}). As the network depth $n$ increases, the logarithm of the probability distribution of the output rotates clockwise from horizontal to a nearly straight line with slope $-1$. On a slower time scale, the distribution also falls as the outputs true and false (not shown here) dominate. For clarity the rising curves are black and the falling curves are gray. The orange points are the third eigenvector of $\mathbf{A}$, which governs the shape of the bulk, translated from $\mathbf{q}$ to $\mathbf{p}$ via eq. (\ref{['xzTranslation']}); the line which they approach is $-I$. A For $k = 4$ arguments, we show the distribution for $n = 0$ and powers of 2. B For $k = 6$ arguments, we show it for $n = 0$ and powers of 3. C For $k = 8$, we show it for $n = 0$ and powers of 4.