Asymptotics of Learning with Deep Structured (Random) Features

Dominik Schröder; Daniil Dmitriev; Hugo Cui; Bruno Loureiro

Asymptotics of Learning with Deep Structured (Random) Features

Dominik Schröder, Daniil Dmitriev, Hugo Cui, Bruno Loureiro

TL;DR

The work delivers a rigorous, high‑dimensional analysis of the test error for learning the readout with deep structured random features, expressing the error in terms of population covariances of the feature maps. It develops anisotropic deterministic equivalents via random matrix theory, and provides a closed-form recursion for the feature covariances in Gaussian rainbow networks, enabling practical predictions for misspecified, deep, structured features. The results connect deep random feature models to trained networks by showing how linearizing the network and studying the effective linear features can capture the observed learning curves in lazy regimes and even align with some real-data trends when covariances are data-driven. This framework offers a principled way to quantify inductive biases and generalization in deep architectures with structured randomness, with potential applications to model selection and understanding gradient-descent dynamics in high dimensions.

Abstract

For a large class of feature maps we provide a tight asymptotic characterisation of the test error associated with learning the readout layer, in the high-dimensional limit where the input dimension, hidden layer widths, and number of training samples are proportionally large. This characterization is formulated in terms of the population covariance of the features. Our work is partially motivated by the problem of learning with Gaussian rainbow neural networks, namely deep non-linear fully-connected networks with random but structured weights, whose row-wise covariances are further allowed to depend on the weights of previous layers. For such networks we also derive a closed-form formula for the feature covariance in terms of the weight matrices. We further find that in some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.

Asymptotics of Learning with Deep Structured (Random) Features

TL;DR

Abstract

Paper Structure (42 sections, 19 theorems, 160 equations, 7 figures)

This paper contains 42 sections, 19 theorems, 160 equations, 7 figures.

Introduction
Code --
Random features
Deep RFs --
Setting
Test error of Lipschitz feature models
Proof of \ref{['thm genRMT informal']}
Population covariance for rainbow networks
Proof of \ref{['theo lin']}
Discussion of Theorem \ref{['theo lin']}
Linearizing trained neural networks
Concluding remarks
Real data ---
Limitations ---
Anisotropic asymptotic equivalents
...and 27 more sections

Key Result

Theorem 3.1

Under ass:labels, ass:data+features and ass:dimensions for fixed $\lambda>0$ we have the asymptotics in the proportional $n\sim k\sim p$ regime, where In the general case of comparable parameters we have the asymptotics with a worse error ofThis allows to identify the leading order of the generalization error as long as the ratio of the largest and smallest parameter is much smaller than the squ

Figures (7)

Figure 1: Test error for a target $\theta_*^\top \tanh(W_* x)$, when learning with a four-layer Gaussian rainbow network with feature map $\varphi(x)=\tanh(W_3\tanh(W_2\tanh(W_1x)))$. All width were taken equal to the input dimension $d$, and the regularization employed is $\lambda=10^{-4}$. The student weights are correlated across layers, with $W_1=W_2$, and the covariance $C_3$ of $W_3$ depending on $W_1$ as $C_3=(W_1W_1^\top+1/2\mathbb{I}_d)^{-1}$. Target/student correlations are also present, with $\check{C}_1=1/2\mathbb{I}_d$. The covariances $C_1,C_2,\tilde{C}_1$ were finally taken to have a spectrum with power-law decay, parametrized by $\gamma$. All details are provided in App. \ref{['app: numerics']}. Solid lines: theoretical prediction of Theorem \ref{['thm genRMT informal']}, in conjunction with the closed-form expression for the features population covariance of Definition \ref{['def: linearized_covs']}. Circles : numerical simulations in $d=1000$.
Figure 2: Test error when training the readout layer only of a relu-activated three-layer neural network during training, using the Tensorflow implementation of the Adam kingma2014adam optimizer, over $120$ epochs with batch size $128$. (dashed): ridge regression. The data is sampled from a Gaussian distribution with mean and variance matching the distribution of MNIST images. In all training procedures, the regularization parameter has been numerically optimized. Solid lines represent the theoretical prediction of Theorem \ref{['thm genRMT informal']}, dots represent numerical experiments. For more details we refer to \ref{['synth MNIST']}.
Figure 3: Test error when re-training the readout layer only of an Adam-optimized relu-activated three-layer neural network, trained on a regression task on MNIST. Labels are $+1$ (resp. $-1$) for even (resp. odd) digits. Solid lines represent the theoretical prediction of Theorem \ref{['thm genRMT informal']}, dots represent numerical experiments on the real dataset. Different colors indicate different reguarization strengths $\lambda$. Different panels correspond to different training times. All details are provided in App. \ref{['app:real_data']}.
Figure 4: Plot of $\mathcal{E}^\mathrm{rmt}_\mathrm{gen}$, $\mathcal{E}_\mathrm{gen}$ for various regularization parameters $\lambda$ and time steps $t$ (in "epoch.step" format). The horizontal lines represent the generalization error of the neural network, the curves $\mathcal{E}^\mathrm{rmt}_\mathrm{gen}$ and the dots $\mathcal{E}_\mathrm{gen}$. The last pane contains a linear regression model for the sake of comparison. Interestingly, for this particular case already the random feature model outperforms linear regression.
Figure 5: Dynamics of $\mathcal{E}^\mathrm{rmt}_\mathrm{gen}$ throughout the training
...and 2 more figures

Theorems & Definitions (42)

Remark 2.4
Definition 2.5: Deep structured feature model
Definition 2.6: Gaussian rainbow ensemble
Theorem 3.1
Remark 3.2: Relation to previous results
Definition 4.2: Linearized population covariances
Conjecture 4.3
Theorem 4.4
Remark 4.5: Comparison
Definition A.1: Lipschitz concentration
...and 32 more

Asymptotics of Learning with Deep Structured (Random) Features

TL;DR

Abstract

Asymptotics of Learning with Deep Structured (Random) Features

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (42)