Precise asymptotic analysis of Sobolev training for random feature models
Katharine E Fisher, Matthew TC Li, Youssef Marzouk, Timo Schorlepp
TL;DR
This paper provides a precise asymptotic analysis of Sobolev training for random feature models in the proportional regime by marrying the replica method with operator-valued free probability. Conditioning on a gradient-alignment variable, the authors derive a low-dimensional fixed-point system that characterizes both training and generalization errors under subspace Sobolev loss, revealing that gradient data shift the interpolation threshold to $p=(k+1)n$ and do not universally improve function generalization. The results show that Sobolev training can help gradient prediction mainly in underparameterized settings, while in highly overparameterized regimes, plain $L^2$ training remains competitive or superior for function prediction. The framework also quantifies how observational noise and gradient-projection cost influence the benefits of gradient information, offering practical guidance on when derivative information should be used. Overall, this work advances theoretical understanding of derivative-informed learning in high-dimensional neural models and provides a robust methodology for analyzing Sobolev-type losses in RF-like architectures.
Abstract
Gradient information is widely useful and available in applications, and is therefore natural to include in the training of neural networks. Yet little is known theoretically about the impact of Sobolev training -- regression with both function and gradient data -- on the generalization error of highly overparameterized predictive models in high dimensions. In this paper, we obtain a precise characterization of this training modality for random feature (RF) models in the limit where the number of trainable parameters, input dimensions, and training data tend proportionally to infinity. Our model for Sobolev training reflects practical implementations by sketching gradient data onto finite dimensional subspaces. By combining the replica method from statistical physics with linearizations in operator-valued free probability theory, we derive a closed-form description for the generalization errors of the trained RF models. For target functions described by single-index models, we demonstrate that supplementing function data with additional gradient data does not universally improve predictive performance. Rather, the degree of overparameterization should inform the choice of training method. More broadly, our results identify settings where models perform optimally by interpolating noisy function and gradient data.
