Learning Multi-Index Models with Hyper-Kernel Ridge Regression
Shuo Huang, Hippolyte Labarrière, Ernesto De Vito, Tomaso Poggio, Lorenzo Rosasco
TL;DR
This work tackles the question of how to theoretically understand deep networks' success in high-dimensional settings by focusing on compositional structure through multi-index models (MIMs). It introduces Hyper-Kernel Ridge Regression (HKRR), which jointly learns a linear representation $B$ and a nonlinear predictor via a family of kernels $k_B(x,x')=k(Bx,Bx')$, and studies HKRR as a bridge between kernel methods and neural representations. The authors prove that HKRR achieves excess-risk rates that depend on the latent dimension $d_*$ rather than the ambient dimension $D$, show Nyström-based compression without loss of rates, provide adaptive procedures to select $d$ and $\\lambda$, and establish optimization guarantees for two nonconvex solvers (VarPro and AGD) under analytic kernels. Numerical experiments corroborate the theory, demonstrating nonconvexity, the impact of initialization, and the practical benefits of overparameterizing the latent dimension. Overall, HKRR offers a principled, scalable framework to learn compositional representations that mitigate the curse of dimensionality while preserving kernel-based interpretability.
Abstract
Deep neural networks excel in high-dimensional problems, outperforming models such as kernel methods, which suffer from the curse of dimensionality. However, the theoretical foundations of this success remain poorly understood. We follow the idea that the compositional structure of the learning task is the key factor determining when deep networks outperform other approaches. Taking a step towards formalizing this idea, we consider a simple compositional model, namely the multi-index model (MIM). In this context, we introduce and study hyper-kernel ridge regression (HKRR), an approach blending neural networks and kernel methods. Our main contribution is a sample complexity result demonstrating that HKRR can adaptively learn MIM, overcoming the curse of dimensionality. Further, we exploit the kernel nature of the estimator to develop ad hoc optimization approaches. Indeed, we contrast alternating minimization and alternating gradient methods both theoretically and numerically. These numerical results complement and reinforce our theoretical findings.
