Table of Contents
Fetching ...

A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point

Carlos Couto, José Mourão, Mário A. T. Figueiredo, Pedro Ribeiro

TL;DR

The paper investigates learning dynamics near the optimum in teacher–student two-layer networks by examining the Hessian eigenspectrum. It derives Hessian blocks, links to the Neural Tangent Kernel and Fisher information, and provides analytic eigenvalue distributions for linear networks (a convolution of scaled chi-square and Marchenko-Pastur laws). It shows that the Hessian rank equals the input dimension in the linear case, establishing an effective parameter count, and extends the analysis to quadratic and higher-order polynomial activations with bounds on effective parameters, while finding that the error-function activation yields a full-rank Hessian and no compression. Together, these results offer a principled notion of parameter efficiency across activations and illuminate how spectrum structure governs long-time learning dynamics and model compression in high dimensions.

Abstract

Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.

A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point

TL;DR

The paper investigates learning dynamics near the optimum in teacher–student two-layer networks by examining the Hessian eigenspectrum. It derives Hessian blocks, links to the Neural Tangent Kernel and Fisher information, and provides analytic eigenvalue distributions for linear networks (a convolution of scaled chi-square and Marchenko-Pastur laws). It shows that the Hessian rank equals the input dimension in the linear case, establishing an effective parameter count, and extends the analysis to quadratic and higher-order polynomial activations with bounds on effective parameters, while finding that the error-function activation yields a full-rank Hessian and no compression. Together, these results offer a principled notion of parameter efficiency across activations and illuminate how spectrum structure governs long-time learning dynamics and model compression in high dimensions.

Abstract

Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.

Paper Structure

This paper contains 27 sections, 65 equations, 9 figures.

Figures (9)

  • Figure 1: Depiction of the architecture of neural networks considered.
  • Figure 2: Agreement between the predicted eigenvalue distribution for the Hessian and numerical simulations.
  • Figure 3: Agreement between the loss function of a student network initialized near the optimum point and the exponential of the smallest eigenvalue.
  • Figure 4: Eigenspectrum distribution for a quadratic teacher-student setup with $\epsilon = 1$ and $N_i = N_h = N$.
  • Figure 5: Eigenspectrum distribution for a error function teacher-student setup with $N_i = N_h = N$.
  • ...and 4 more figures