Lecture notes: From Gaussian processes to feature learning

Moritz Helias; Javed Lindner; Lars Schutzeichel; Zohar Ringel

Lecture notes: From Gaussian processes to feature learning

Moritz Helias, Javed Lindner, Lars Schutzeichel, Zohar Ringel

TL;DR

These notes present a Bayesian framework for understanding learning in deep and recurrent neural networks, highlighting two complementary routes to feature learning: (i) the Gaussian-process (lazy-learning) limit where networks converge to a kernel with fixed representations, and (ii) adaptive-kernel approaches where the kernel itself evolves with data to enable feature learning. The text develops the necessary probabilistic tools, derives the neural network Gaussian process (NNGP) limits for shallow and deep architectures, and extends to recurrent networks, connecting Bayesian inference with Langevin training via Fokker-Planck dynamics. Central to the exposition are the large-width/large-N analyses, the use of cumulants and moment-generating functions, and the application of large deviation theory to obtain saddle-point kernels and predictor statistics. The notes demonstrate how kernel scaling and kernel adaptation provide rigorous, complementary mechanisms for feature learning and connect to kernel-rescaling perspectives, offering a field-theoretic view on inductive biases, generalization, and the impact of architecture on learning dynamics.

Abstract

These lecture notes develop the theory of learning in deep and recurrent neuronal networks from the point of view of Bayesian inference. The aim is to enable the reader to understand typical computations found in the literature in this field. Initial chapters develop the theoretical tools, such as probabilities, moment and cumulant-generating functions, and some notions of large deviation theory, as far as they are needed to understand collective network behavior with large numbers of parameters. The main part of the notes derives the theory of Bayesian inference for deep and recurrent networks, starting with the neural network Gaussian process (lazy-learning) limit, which is subsequently extended to study feature learning from the point of view of adaptive kernels. The notes also expose the link between the adaptive kernel approach and approaches of kernel rescaling.

Lecture notes: From Gaussian processes to feature learning

TL;DR

Abstract

Paper Structure (89 sections, 377 equations, 16 figures)

This paper contains 89 sections, 377 equations, 16 figures.

Introduction
Related works
Probabilities, moments, cumulants
Probabilities, observables, and moments
Transformation of random variables
Joint distribution and conditional distribution
Cumulants
Connection between moments and cumulants
Recovering the probability density
Keypoints
Gaussian distribution and Wick's theorem
Gaussian distribution
Moment and cumulant generating function of a Gaussian
Wick's theorem
Appendix: Self-adjoint operators
...and 74 more sections

Figures (16)

Figure 1: Linear regression in Bayesian framework. Comparison between prior and posterior distributions for the linear model. Here the output $p(y|X,w)$ of the linear regression is assumed to be stochastic with a Gaussian regularization noise, namely instead of \ref{['eq:conditional_output_given_w']} we here use $p(y|X,w)=\mathcal{N}(y|w^{\mathrm{T}}x,\kappa\,\mathbb{I})$ which corresponds to adding Gaussian noise $\xi_{\alpha}\stackrel{\text{i.i.d.}}{\sim}\mathcal{N}(0,\kappa)$, i.e. $y_{\alpha}\to y_{\alpha}+\xi_{\alpha}$; this is often done as a means of regularization: it forces the outputs to be close to the training points, but allows for some wiggle room. a) Prior and posterior of labels $y_{\ast}$ shown as mean and standard deviation from \ref{['eq:posterior_Gaussi']}. The posterior is obtained by conditioning on the training labels $y_{\circ}$. b) Prior and posterior distributions of the slope of the linear model. c) Same as a) but for zero noise ($\kappa=0$). d) Same as b) but for zero noise ($\kappa=0$). (Adapted from Bachelor thesis by Bastian Epping, 2020.)
Figure 2: Sketch of a deep network with input $x$, $L+1$ hidden layers $h^{(0)},\ldots,h^{(L)}$ and a scalar output $y$.
Figure 3: Neural Network Gaussian Process (NNGP) for erf-activation function. Display of the diagonal $C_{\alpha\alpha}$ and off-diagonal $C_{\alpha\beta}$ elements of NNGP kernel a) Dependence of output variance $C_{\alpha\alpha}^{(a)}$ for different depths. b) Fixpoint values of variance for different hidden gain values $g$. c) Input weight variance initialization determined such that one obtains a fixed point $C_{\alpha\alpha}^{(a)}$ for different bias values. d) Output covariance $C_{\alpha\beta}^{(a)}$ as a function of input covariance $C_{\alpha\beta}^{(0)}$ with $g_{v}=0.1$. e) Output covariance $C_{\alpha\beta}^{(a)}$ as a function of input covariance $C_{\alpha\beta}^{(0)}$ with $g_{v}$ set so that $C_{\alpha\alpha}^{(a)}$ is initialized at the fixpoint. f) Same setting as in e) for different values of the bias variance $g_{b}$ and for $5$ layers. All results are produced for $\phi=\mathrm{erf}$ and with a regularization variance of $\kappa=1$.
Figure 4: Data samples and dot-product kernel for the Ising spin task \ref{['subsec:Ising-spin-task']}. a) Two data samples $x_{\alpha}$ for each class. Left two vectors: $z_{\alpha}=1$. Right two vectors: $z_{\alpha}=-1$. $p=0.7$. b) Two realizations of the corresponding dot-product kernels $K_{\alpha\beta}\in\mathbb{R}^{D\times D}$ showing a block-like structure. $D=40$. The covariance $\Sigma_{(\alpha\beta)(\alpha\delta)}$ captures the variability between specific entries (i.e. between $(\alpha\beta)$ and $(\alpha\delta)$) across different realizations of the kernel.
Figure 5: a Recurrent network with input $x$ and a scalar output $y$, where the activity evolves in discrete time steps $t=0,\ldots,T$. b Equivalent representation by “ unrolling” time into $T+1$ hidden layers $h^{(0)},\ldots,h^{(T)}$: The recurrent network may be thought of as a deep network, where the single layer of neurons that is actually present is copied for each time step $t$ and connected to the layer in the next time step $t+1$ by the very same connectivity $W$ for all adjacent time steps. This “ weight sharing” over layers will be the cause of correlated activity across layers.
...and 11 more figures

Lecture notes: From Gaussian processes to feature learning

TL;DR

Abstract

Lecture notes: From Gaussian processes to feature learning

Authors

TL;DR

Abstract

Table of Contents

Figures (16)