Table of Contents
Fetching ...

Precise asymptotic analysis of Sobolev training for random feature models

Katharine E Fisher, Matthew TC Li, Youssef Marzouk, Timo Schorlepp

TL;DR

This paper provides a precise asymptotic analysis of Sobolev training for random feature models in the proportional regime by marrying the replica method with operator-valued free probability. Conditioning on a gradient-alignment variable, the authors derive a low-dimensional fixed-point system that characterizes both training and generalization errors under subspace Sobolev loss, revealing that gradient data shift the interpolation threshold to $p=(k+1)n$ and do not universally improve function generalization. The results show that Sobolev training can help gradient prediction mainly in underparameterized settings, while in highly overparameterized regimes, plain $L^2$ training remains competitive or superior for function prediction. The framework also quantifies how observational noise and gradient-projection cost influence the benefits of gradient information, offering practical guidance on when derivative information should be used. Overall, this work advances theoretical understanding of derivative-informed learning in high-dimensional neural models and provides a robust methodology for analyzing Sobolev-type losses in RF-like architectures.

Abstract

Gradient information is widely useful and available in applications, and is therefore natural to include in the training of neural networks. Yet little is known theoretically about the impact of Sobolev training -- regression with both function and gradient data -- on the generalization error of highly overparameterized predictive models in high dimensions. In this paper, we obtain a precise characterization of this training modality for random feature (RF) models in the limit where the number of trainable parameters, input dimensions, and training data tend proportionally to infinity. Our model for Sobolev training reflects practical implementations by sketching gradient data onto finite dimensional subspaces. By combining the replica method from statistical physics with linearizations in operator-valued free probability theory, we derive a closed-form description for the generalization errors of the trained RF models. For target functions described by single-index models, we demonstrate that supplementing function data with additional gradient data does not universally improve predictive performance. Rather, the degree of overparameterization should inform the choice of training method. More broadly, our results identify settings where models perform optimally by interpolating noisy function and gradient data.

Precise asymptotic analysis of Sobolev training for random feature models

TL;DR

This paper provides a precise asymptotic analysis of Sobolev training for random feature models in the proportional regime by marrying the replica method with operator-valued free probability. Conditioning on a gradient-alignment variable, the authors derive a low-dimensional fixed-point system that characterizes both training and generalization errors under subspace Sobolev loss, revealing that gradient data shift the interpolation threshold to and do not universally improve function generalization. The results show that Sobolev training can help gradient prediction mainly in underparameterized settings, while in highly overparameterized regimes, plain training remains competitive or superior for function prediction. The framework also quantifies how observational noise and gradient-projection cost influence the benefits of gradient information, offering practical guidance on when derivative information should be used. Overall, this work advances theoretical understanding of derivative-informed learning in high-dimensional neural models and provides a robust methodology for analyzing Sobolev-type losses in RF-like architectures.

Abstract

Gradient information is widely useful and available in applications, and is therefore natural to include in the training of neural networks. Yet little is known theoretically about the impact of Sobolev training -- regression with both function and gradient data -- on the generalization error of highly overparameterized predictive models in high dimensions. In this paper, we obtain a precise characterization of this training modality for random feature (RF) models in the limit where the number of trainable parameters, input dimensions, and training data tend proportionally to infinity. Our model for Sobolev training reflects practical implementations by sketching gradient data onto finite dimensional subspaces. By combining the replica method from statistical physics with linearizations in operator-valued free probability theory, we derive a closed-form description for the generalization errors of the trained RF models. For target functions described by single-index models, we demonstrate that supplementing function data with additional gradient data does not universally improve predictive performance. Rather, the degree of overparameterization should inform the choice of training method. More broadly, our results identify settings where models perform optimally by interpolating noisy function and gradient data.

Paper Structure

This paper contains 65 sections, 5 theorems, 255 equations, 23 figures, 4 tables.

Key Result

Lemma D.1

For all symmetric $A, B, C \in \mathbb{R}^{p \times p}$ we have the identity $\mathop{\mathrm{tr}}\nolimits( (A \odot B) C)) = \mathop{\mathrm{tr}}\nolimits( (B \odot C) A)$.

Figures (23)

  • Figure 1: Illustration of the single hidden-layer RF model (left column) and its generalization performance in the high-dimensional limit (right column). In the right subfigures, lines correspond to theoretical predictions, while squares and circles show the mean over 1000 Monte Carlo samples in dimension $d=100$ (with error bars at $25\%$ and $75\%$ quantiles of the data). The horizontal axis is $p/n$. The dashed lines and squares correspond to least squares minimization of the readout weights $w$ using only function data, while solid lines and circles indicate Sobolev training where additional gradient information is used. Shaded regions cover the predicted $25\%$ and $75\%$ quantiles, while thick lines represent the mean. $L^2$ error (top right) refers to the mismatch in predicted function values, while the $H^1_k$ semi-norm error (bottom right) is the gradient mismatch when projected onto the $k$-dimensional subspace used for training. Numerical details (cf. Section \ref{['sec:setup']}): regularization $\lambda = 0.001$, no observational noise, ridge function $\phi(\omega) = \arctan(\omega) + 1 / \cosh(\omega)$, activation function $\sigma = \text{ReLU}$, $k =1$ gradient sketches, $\tau = 1$ gradient term weight, $n/d = 2.345$ number of samples per dimension.
  • Figure 2: Comparison of MC samples of \ref{['eq:training-problem']} to evaluate \ref{['eq:gen-error-def']} and \ref{['eq:overlap-def']} at $p/n=0.5$ and $n/d=2.345$ in finite dimensions $d \in \{200, 500, 1000, 2000\}$, against theoretical predictions \ref{['eq:l2_gen_error_varpi']}, \ref{['eq:sobo_gen_error_varpi']}, and \ref{['eq:asymptotic_saddle_final_hat']}. Other parameters: $\sigma=\mathrm{erf}$, $\phi=\arctan$, $k=1$. Left column: distribution of $L^2$ and $H^1_k$ generalization errors as a function of $\varpi = V_k^\top \theta_0$. Center and right columns: marginal distributions of the $(f_a, f_b)$ and $(q_a, q_c)$ overlap parameters.
  • Figure 3: Comparison of expected $L^2$ generalization error (left three columns) and $H^1_k$ generalization error (right three columns) of $L^2$ training ($\tau = 0$) and Sobolev training ($\tau = 1$) for $k = 1$ gradient projections as a function of the number of training samples $n$ and network features $p$, normalized by the dimension $d$. Rows correspond to different regularization strengths $\lambda \in \{10^{-1}, 10^{-4} \}$ and observational noise levels $\Delta^2 \in \{0, 4\}$ for $y_i$ and $V_k^\top y_i'$. Other parameters: $\sigma = \text{ReLU}$, $\phi = \arctan + 1 / \cosh$. All plots use the expected errors $\mathbb{E} [\varepsilon_{\text{gen}}^{L^2} ]$ and $\mathbb{E} [\varepsilon_{\text{gen}}^{H^1_k} ]$ over the alignment $\varpi \sim {\cal N}(0,1)$ in the limit \ref{['eq:prop-asymp-def']}, as predicted from the theory presented in Section \ref{['sec:main-theory-res']}. The errors themselves are shown on a logarithmic color scale while their relative difference is shown on a linear color scale that is symmetric around zero. Negative relative differences, shown in blue in the third and sixth column, indicate regimes (delimited by the black dashed lines) where Sobolev training outperforms $L^2$ training.
  • Figure 4: Continuous part of the empirical spectral densities for one sample of the feature matrix $K$, defined in \ref{['eq:K-matrix-def']}, at different numbers of features $p/d$. We compare standard $L^2$ training ($\tau = 0$, dashed blue lines) to Sobolev training ($\tau = 1$, $k = 1$, solid red lines). Other parameters are: $n/d = 5$, $d = 1000$, $\sigma = \text{ReLU}$. The spectral gap to $0$ closes at $p = n$ for $L^2$-training and at $p = 2n$ for Sobolev training with $k = 1$.
  • Figure 5: Error against ground truth achieved by $L^2$ training (first row) and Sobolev training (second row) on unseen test cases given a range of noise levels in the training data: $\Delta\in\{0.0,0.2,0.4,0.6,0.8,1.0\}$. Left column: the $L^2$ error of network predictions against $\phi(\theta_0^\top x)$, averaged over $x$. A lower bound to accuracy is given by the gray dotted line which marks the magnitude of the nonlinear component of $\phi$. The distributions predicted by Sobolev training are induced by $\varpi=V_k^\top \theta_0$, and ribbons shade between the $20\%$ and $80\%$ quantiles. Right column: the $H_k^1$ error found by averaging the squared difference between the network gradient predictions and $V_k^\top \theta_0 \phi'(\theta_0^\top x)$ over $x$ when $k=1$. The ribbons cover between the $50\%$ and $75\%$ of the $\chi^2$ distribution resulting from the random gradient projection. Parameters: $n/d=2.345$, $\lambda=10^{-6}$, $\phi(\omega) = \omega / 2 - \exp \{-\omega^2/2\}$, and $\sigma=\textrm{SiLU}$.
  • ...and 18 more figures

Theorems & Definitions (31)

  • Remark 1.1
  • Remark 2.1
  • Remark 2.2
  • Remark 2.3
  • Lemma D.1
  • proof
  • Definition G.1
  • Definition G.2
  • Definition G.3
  • Example G.1
  • ...and 21 more