Table of Contents
Fetching ...

Learning Multi-Index Models with Hyper-Kernel Ridge Regression

Shuo Huang, Hippolyte Labarrière, Ernesto De Vito, Tomaso Poggio, Lorenzo Rosasco

TL;DR

This work tackles the question of how to theoretically understand deep networks' success in high-dimensional settings by focusing on compositional structure through multi-index models (MIMs). It introduces Hyper-Kernel Ridge Regression (HKRR), which jointly learns a linear representation $B$ and a nonlinear predictor via a family of kernels $k_B(x,x')=k(Bx,Bx')$, and studies HKRR as a bridge between kernel methods and neural representations. The authors prove that HKRR achieves excess-risk rates that depend on the latent dimension $d_*$ rather than the ambient dimension $D$, show Nyström-based compression without loss of rates, provide adaptive procedures to select $d$ and $\\lambda$, and establish optimization guarantees for two nonconvex solvers (VarPro and AGD) under analytic kernels. Numerical experiments corroborate the theory, demonstrating nonconvexity, the impact of initialization, and the practical benefits of overparameterizing the latent dimension. Overall, HKRR offers a principled, scalable framework to learn compositional representations that mitigate the curse of dimensionality while preserving kernel-based interpretability.

Abstract

Deep neural networks excel in high-dimensional problems, outperforming models such as kernel methods, which suffer from the curse of dimensionality. However, the theoretical foundations of this success remain poorly understood. We follow the idea that the compositional structure of the learning task is the key factor determining when deep networks outperform other approaches. Taking a step towards formalizing this idea, we consider a simple compositional model, namely the multi-index model (MIM). In this context, we introduce and study hyper-kernel ridge regression (HKRR), an approach blending neural networks and kernel methods. Our main contribution is a sample complexity result demonstrating that HKRR can adaptively learn MIM, overcoming the curse of dimensionality. Further, we exploit the kernel nature of the estimator to develop ad hoc optimization approaches. Indeed, we contrast alternating minimization and alternating gradient methods both theoretically and numerically. These numerical results complement and reinforce our theoretical findings.

Learning Multi-Index Models with Hyper-Kernel Ridge Regression

TL;DR

This work tackles the question of how to theoretically understand deep networks' success in high-dimensional settings by focusing on compositional structure through multi-index models (MIMs). It introduces Hyper-Kernel Ridge Regression (HKRR), which jointly learns a linear representation and a nonlinear predictor via a family of kernels , and studies HKRR as a bridge between kernel methods and neural representations. The authors prove that HKRR achieves excess-risk rates that depend on the latent dimension rather than the ambient dimension , show Nyström-based compression without loss of rates, provide adaptive procedures to select and , and establish optimization guarantees for two nonconvex solvers (VarPro and AGD) under analytic kernels. Numerical experiments corroborate the theory, demonstrating nonconvexity, the impact of initialization, and the practical benefits of overparameterizing the latent dimension. Overall, HKRR offers a principled, scalable framework to learn compositional representations that mitigate the curse of dimensionality while preserving kernel-based interpretability.

Abstract

Deep neural networks excel in high-dimensional problems, outperforming models such as kernel methods, which suffer from the curse of dimensionality. However, the theoretical foundations of this success remain poorly understood. We follow the idea that the compositional structure of the learning task is the key factor determining when deep networks outperform other approaches. Taking a step towards formalizing this idea, we consider a simple compositional model, namely the multi-index model (MIM). In this context, we introduce and study hyper-kernel ridge regression (HKRR), an approach blending neural networks and kernel methods. Our main contribution is a sample complexity result demonstrating that HKRR can adaptively learn MIM, overcoming the curse of dimensionality. Further, we exploit the kernel nature of the estimator to develop ad hoc optimization approaches. Indeed, we contrast alternating minimization and alternating gradient methods both theoretically and numerically. These numerical results complement and reinforce our theoretical findings.

Paper Structure

This paper contains 40 sections, 17 theorems, 150 equations, 6 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

Suppose Assumption ass:0 holds. Let $0<\delta<2/e$, $\zeta < r/(d_*+r)$ and $\lambda = \lambda_m= m^{-\zeta}$. Then with probability at least $1-\delta$, there holds for all $m\geq m_\delta$, where $m_\delta$ is independent of $D$, $d_*$ and $C_1$ is a constant independent of $D$, $d_*$ and $\delta$.

Figures (6)

  • Figure 1: Comparison between VarPro (red) and AGD (blue). (\ref{['fig:losses_col']}) Training losses across time for two random initializations of $B^0$. (\ref{['fig:path_col1']}) Two-dimensional toy example with initialization $(-1.5,-1.5)$: AGD escapes a local minimum where VarPro remains stuck. (\ref{['fig:path_col2']}) Initialization $(-1.5,-0.1)$: both methods converge to minima, with VarPro being significantly faster. See Appendix \ref{['app:2d_case']} for additional details.
  • Figure 2: R2 score on test sets for $B$ and $\alpha$ learned by VarPro (red) and AGD (blue). Top: performance w.r.t. the parameter $d$ for Dataset 1 (left) and Dataset 2 (right), with true latent dimension $d_*=3$, $D=50$. Bottom: performance for $d\in\{3,20,50\}$ as the training size increases for Dataset 1. See Appendix \ref{['app:setting_exp']} for further details.
  • Figure 3: Left and center: Trajectories of the iterates of VarPro (in red) and AGD (in blue) for $f:x,y\mapsto\left(x-y^2\right)^2+\cos(\pi y)+(1-y)^2+1$ (taking $(x_0,y_0)=(-1.5,-1.5)$). Right: Value of the loss function w.r.t. the number of iterations.
  • Figure 4: Convergence map of VarPro and AGD for minimizing $f:x,y\mapsto\left(x-y^2\right)^2+\cos(\pi y)+(1-y)^2+1$. Purple = both methods converge to the global minimum from the corresponding initialization point; Red = only VarPro converges to the global minimum; Blue = only AGD converges to the global minimum; Gray = no method converges to the global minimum. The white star is the global minimizer of $f$ and the yellow '+' crosses are local minimizers.
  • Figure 5: Left and center: Trajectories of the iterates of VarPro (in red) and AGD (in blue) for $f:x,y\mapsto\left(x-\sigma(y)\right)^2+\cos(\pi y)+(1-y)^2+1$ (taking $(x_0,y_0)=(5,-1.75)$). Right: Value of the loss function w.r.t. the number of iterations.
  • ...and 1 more figures

Theorems & Definitions (41)

  • Remark 1: Neural and RBF networks
  • Theorem 1
  • Remark 2
  • Remark 3: Beating the curse of dimensionality
  • Remark 4: Suboptimal rate
  • Theorem 2
  • Remark 5
  • Remark 6
  • Theorem 3
  • Theorem 4: Convergence of AGD and VarPro (informal)
  • ...and 31 more