Lecture notes: From Gaussian processes to feature learning
Moritz Helias, Javed Lindner, Lars Schutzeichel, Zohar Ringel
TL;DR
These notes present a Bayesian framework for understanding learning in deep and recurrent neural networks, highlighting two complementary routes to feature learning: (i) the Gaussian-process (lazy-learning) limit where networks converge to a kernel with fixed representations, and (ii) adaptive-kernel approaches where the kernel itself evolves with data to enable feature learning. The text develops the necessary probabilistic tools, derives the neural network Gaussian process (NNGP) limits for shallow and deep architectures, and extends to recurrent networks, connecting Bayesian inference with Langevin training via Fokker-Planck dynamics. Central to the exposition are the large-width/large-N analyses, the use of cumulants and moment-generating functions, and the application of large deviation theory to obtain saddle-point kernels and predictor statistics. The notes demonstrate how kernel scaling and kernel adaptation provide rigorous, complementary mechanisms for feature learning and connect to kernel-rescaling perspectives, offering a field-theoretic view on inductive biases, generalization, and the impact of architecture on learning dynamics.
Abstract
These lecture notes develop the theory of learning in deep and recurrent neuronal networks from the point of view of Bayesian inference. The aim is to enable the reader to understand typical computations found in the literature in this field. Initial chapters develop the theoretical tools, such as probabilities, moment and cumulant-generating functions, and some notions of large deviation theory, as far as they are needed to understand collective network behavior with large numbers of parameters. The main part of the notes derives the theory of Bayesian inference for deep and recurrent networks, starting with the neural network Gaussian process (lazy-learning) limit, which is subsequently extended to study feature learning from the point of view of adaptive kernels. The notes also expose the link between the adaptive kernel approach and approaches of kernel rescaling.
