Table of Contents
Fetching ...

Random features and polynomial rules

Fabián Aguirre-López, Silvio Franz, Mauro Pastore

TL;DR

This work analyzes the generalization of random features models (RFMs) with Gaussian inputs by mapping the RFM to an equivalent polynomial model via a Hermite expansion of the activation. Using replica methods, the authors derive replica-symmetric saddle-point equations that reveal how learning progresses in hierarchical feature orders as $N \sim D^L$ and $P \sim D^K$, producing staircase generalization curves and an interpolation peak when $N \approx P$. The key insight is that the RFM behaves as a high-rank kernel machine for $P \ll N$ and as a degree-$L$ polynomial student for $P, N$ scaling with $D$, with higher-order Hermite components acting as noise that drives the interpolation peak. The authors validate their analytic predictions with numerical experiments across broad parameter regimes and provide a finite-size effective theory that captures realistic network sizes. These results deepen understanding of feature learning and generalization in RFMs beyond traditional proportional-scaling limits, with implications for kernel methods and lazy training regimes.

Abstract

Random features models play a distinguished role in the theory of deep learning, describing the behavior of neural networks close to their infinite-width limit. In this work, we present a thorough analysis of the generalization performance of random features models for generic supervised learning problems with Gaussian data. Our approach, built with tools from the statistical mechanics of disordered systems, maps the random features model to an equivalent polynomial model, and allows us to plot average generalization curves as functions of the two main control parameters of the problem: the number of random features $N$ and the size $P$ of the training set, both assumed to scale as powers in the input dimension $D$. Our results extend the case of proportional scaling between $N$, $P$ and $D$. They are in accordance with rigorous bounds known for certain particular learning tasks and are in quantitative agreement with numerical experiments performed over many order of magnitudes of $N$ and $P$. We find good agreement also far from the asymptotic limits where $D\to \infty$ and at least one between $P/D^K$, $N/D^L$ remains finite.

Random features and polynomial rules

TL;DR

This work analyzes the generalization of random features models (RFMs) with Gaussian inputs by mapping the RFM to an equivalent polynomial model via a Hermite expansion of the activation. Using replica methods, the authors derive replica-symmetric saddle-point equations that reveal how learning progresses in hierarchical feature orders as and , producing staircase generalization curves and an interpolation peak when . The key insight is that the RFM behaves as a high-rank kernel machine for and as a degree- polynomial student for scaling with , with higher-order Hermite components acting as noise that drives the interpolation peak. The authors validate their analytic predictions with numerical experiments across broad parameter regimes and provide a finite-size effective theory that captures realistic network sizes. These results deepen understanding of feature learning and generalization in RFMs beyond traditional proportional-scaling limits, with implications for kernel methods and lazy training regimes.

Abstract

Random features models play a distinguished role in the theory of deep learning, describing the behavior of neural networks close to their infinite-width limit. In this work, we present a thorough analysis of the generalization performance of random features models for generic supervised learning problems with Gaussian data. Our approach, built with tools from the statistical mechanics of disordered systems, maps the random features model to an equivalent polynomial model, and allows us to plot average generalization curves as functions of the two main control parameters of the problem: the number of random features and the size of the training set, both assumed to scale as powers in the input dimension . Our results extend the case of proportional scaling between , and . They are in accordance with rigorous bounds known for certain particular learning tasks and are in quantitative agreement with numerical experiments performed over many order of magnitudes of and . We find good agreement also far from the asymptotic limits where and at least one between , remains finite.
Paper Structure (28 sections, 118 equations, 5 figures, 1 table)

This paper contains 28 sections, 118 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Left: generalization error of the RFM on a classification task, as a function of the size of the training set $P$, for $D=30$, $N=10^{4}$, weights regularization $\zeta=10^{-8}$, quadratic teacher (balanced: $\tau_1=\tau_2 = 1/\sqrt{2}$, $\tau_{\ell>2}=0$) and ELU activation functions (defined in Eq. \ref{['eq:ELU']} below); the continuous line is the equivalent polynomial theory devised in Sec. \ref{['sec:kernel']}, truncated at $L=3$; dashed lines are the asymptotic theories (see Sec \ref{['sec:asympt']} for details) for $N\to\infty$ and $P/D$ finite (red), $N\to\infty$ and $P/\binom{D}{2}$ finite (yellow), $N\to\infty$ and $P/\binom{D}{3}$ finite (blue), $P/\binom{D}{3}$ and $N/P$ finite (green); black points are results from numerical experiments averaged over 50 instances (see Appendix \ref{['app:numerics']}). The model learns the linear features (first step at $P\sim O( D )$), then learns the quadratic features (second step at $P\sim O( D^2 )$), then follows the interpolation peak at $P\sim N$. Right: numerical and theoretical teacher-student overlaps -- defined in Eq. \ref{['eq:def-orderpars']} and \ref{['eq:RS']} -- of the linear and quadratic features (the overlap of the cubic features is identically 0 by definition); the parameters of the model are the same as for the left panel.
  • Figure 2: Generalization error of a RFM on a classification task, as a function of the number of hidden units $N$, for $P=10^4$ and the rest of the parameters as in Fig. \ref{['fig:T2']}; continuous lines are the theories truncated at $L'=1,2,3$ (respectively: blue, yellow, red); numerical points (in black) are nicely interpolating between these curves in the regimes where $N\sim O( D ),O( D^2 ),O( D^3 )$, validating Eq. \ref{['eq:Ktruncated']}, where the truncation $L'$ of the equivalent polynomial theory is fixed at $L \sim \log(N)/\log(D)$.
  • Figure 3: Left: generalization error of the RFM on a classification task, as a function of the size of the training set $P$, for $D=30$, $N=10^{4}$, weights regularization $\zeta=10^{-8}$, linear teacher ($\tau_1= 1$, $\tau_{\ell>1}=0$) and ELU activation functions; the continuous line is the mean-field theory truncated at $L=3$; dashed lines are the asymptotic theories for $P/D$ finite and $L>1$ (red), $P/\binom{D}{2}$ finite and $L>2$ (yellow), $P/\binom{D}{3}$ finite and $L>3$ (blue), $P/\binom{D}{3}$ finite and $L=3$ (green); black points are results from numerical experiments averaged over 50 instances (see Appendix \ref{['app:numerics']}). The model learns the linear features (first step at $P\sim O( D )$), then overfits the quadratic features before learning they are zero (peak at $P\sim O( D^2 )$), then follows the interpolation peak $P\sim N$. Notice how the accordance between the mean-field theory and the experiment is only qualitative around the last peak. Right: Generalization error on classification for a linear teacher, as a function of the number of random features $N$, for different amounts of data $P$ ($D=30$, $\zeta=10^{-4}$, see Appendix \ref{['app:numerics']}). The optimal amount of hidden units, for which $\epsilon_g$ is minimal, shifts from overparametrization to underparametrization, as it is visible in the curves for $P=40$ and $P=200,400$. At fixed value of $N$, not always more data means better generalization: after the interpolation peak, the order between the red ($P=400$) and yellow ($P=200$) curves is reversed (point of view complementary to the plot in the left panel, where, at fixed $N$, the error can increase with $P$). The curves as functions of $N$ are obtained by gluing together the theories truncated at the corresponding $L$.
  • Figure 4: Generalization error vs $P$ ($D=30$, $N=10^4$) on classification for a purely cubic teacher ($\tau_3 =1$); in blue, polynomial theory and numerical experiments for ReLU activation function \ref{['eq:ReLU']}: in this case, $\mu_3 =0$ and the model cannot learn the cubic features, so the error remains $1/2$; in yellow and red (respectively, for $\zeta=10^{-4},10^{-8}$), the case of ELU \ref{['eq:ELU']}, for which $\mu_3 \neq 0$ and the model can learn the cubic features.
  • Figure 5: Top row -- empirical (30 instances, $D=20$, $N=D^3$) vs. analytical (MP) distributions of the non-zero eigenvalues of the matrices defined in Sec. \ref{['app:Cl']}: $C^{(1,1)}$ (left), $C^{(2,1)}/D$, $C^{(2,2)}$ (center), $C^{(3,1)}/D^2$, $3 C^{(3,2)}/D$, $C^{(3,3)}$ (right). Bottom row -- comparison of the analytical curves with the empirical distribution (notice the log scale on the axes) of $C^{\odot 2}$ (left), $C^{\odot 3}$ (center) and $C^{\odot 1}+ C^{\odot 2} + C^{\odot 3}$ (right); analytical curves in the bottom row are rescaled in such a way that the sum of the densities in each panel is normalized.