Table of Contents
Fetching ...

Dimension-free deterministic equivalents and scaling laws for random feature regression

Leonardo Defilippis, Bruno Loureiro, Theodor Misiakiewicz

TL;DR

A general deterministic equivalent for the test error of RFRR is shown, under a certain concentration property, to be well approximated by a closed-form expression that only depends on the feature map eigenvalues.

Abstract

In this work we investigate the generalization performance of random feature ridge regression (RFRR). Our main contribution is a general deterministic equivalent for the test error of RFRR. Specifically, under a certain concentration property, we show that the test error is well approximated by a closed-form expression that only depends on the feature map eigenvalues. Notably, our approximation guarantee is non-asymptotic, multiplicative, and independent of the feature map dimension -- allowing for infinite-dimensional features. We expect this deterministic equivalent to hold broadly beyond our theoretical analysis, and we empirically validate its predictions on various real and synthetic datasets. As an application, we derive sharp excess error rates under standard power-law assumptions of the spectrum and target decay. In particular, we provide a tight result for the smallest number of features achieving optimal minimax error rate.

Dimension-free deterministic equivalents and scaling laws for random feature regression

TL;DR

A general deterministic equivalent for the test error of RFRR is shown, under a certain concentration property, to be well approximated by a closed-form expression that only depends on the feature map eigenvalues.

Abstract

In this work we investigate the generalization performance of random feature ridge regression (RFRR). Our main contribution is a general deterministic equivalent for the test error of RFRR. Specifically, under a certain concentration property, we show that the test error is well approximated by a closed-form expression that only depends on the feature map eigenvalues. Notably, our approximation guarantee is non-asymptotic, multiplicative, and independent of the feature map dimension -- allowing for infinite-dimensional features. We expect this deterministic equivalent to hold broadly beyond our theoretical analysis, and we empirically validate its predictions on various real and synthetic datasets. As an application, we derive sharp excess error rates under standard power-law assumptions of the spectrum and target decay. In particular, we provide a tight result for the smallest number of features achieving optimal minimax error rate.
Paper Structure (36 sections, 25 theorems, 350 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 36 sections, 25 theorems, 350 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.3

Under Assumptions ass:concentration_eigenfunctions, ass:technical and for any $D,K >0$, there exist constants $\eta_* \in (0,1/2)$ and $C_{*,D,K}>0$ such that the following holds. For any $n , p \geq C_{*,D,K}$, regularization $\lambda >0$, and target function $f_{\star} \in L_{2}(\mu_x)$, if then with probability at least $1 - n^{-D} - p^{-D}$, we have where $\mathsf R_{n,p}({\boldsymbol \beta}

Figures (9)

  • Figure 1: Excess risk \ref{['eq:def:risk']} of RFRR as a function of the number of features $p$ for a fixed number of samples $n$. Solid lines are obtained from the deterministic equivalent in \ref{['thm:main_test_error_RFRR']}, and points are numerical simulations, with the different curves denoting different regularization strengths $\lambda\geq 0$. (Left) Training data $({\boldsymbol x}_{i},y_{i})_{i\in[n]}$, $n = 500$, sampled from a teacher-student model $y_{i}=\erf(\langle {\boldsymbol \beta}, {\boldsymbol x}_i\rangle) +\varepsilon_i$, $\sigma_\varepsilon^2=0.1$, ${\boldsymbol x}_{i}\sim_{\text{i.i.d.}}\mathcal{N}(0,{\boldsymbol I}_{d})$, with a spiked random feature map $\varphi({\boldsymbol x}, {\boldsymbol w}) = \operatorname{tanh}(\langle {\boldsymbol w} + u{\boldsymbol v},{\boldsymbol x}\rangle)$ where ${\boldsymbol v}\in\mathbb{R}^{d}$ has a fixed overlap $\gamma = \langle {\boldsymbol v}, {\boldsymbol \beta}\rangle$ with the teacher vector, ${\boldsymbol w}\sim\mathcal{N}(0,d^{-1}{\boldsymbol I}_d)$, $u\sim\mathcal{N}(0,1)$. (Right) Training data $({\boldsymbol x}_{i}, y_{i})_{i\in[n]}$, $n = 300$, sub-sampled from the FashionMNIST data set xiao2017fashionmnist, with feature map given by $\varphi({\boldsymbol x};{\boldsymbol w}) = \erf(\langle {\boldsymbol w}, {\boldsymbol x}\rangle)$ and $\mu_w=\mathcal{N}(0,d^{-1}{\boldsymbol I}_{d})$.
  • Figure 2: Excess error rate $\gamma$ in the regime $n \gg \sigma_\varepsilon ^{-1/(\gamma_{\mathcal{B}}(\ell, q) - \gamma_{{\mathcal{V}}}(\ell, q))}$ as a function of $(\ell, q)$, defined in \ref{['eq:rates:risk']} and \ref{['def:feat_reg_powerlaw']} for $r\geq 1/2$ ( Left) and $r\in [0,1/2)$ ( Right). The explicit crossover points $\ell_{\star}, q_{\star}, \hat{q}$ are defined in \ref{['eq:rates:crossover']} as a function of the source $r$ and capacity $\alpha$ exponents.
  • Figure 3: Excess risk \ref{['eq:def:risk']} of RFRR as a function of the number of samples $n$ under source and capacity conditions \ref{['eq:def:sourcecapacity']} and power-law assumptions $\lambda = n^{-(\ell-1)}$, $p=n^{q}$, with noise variance $\sigma_\varepsilon^2 = 0.1$. Solid lines are obtained from the deterministic equivalent \ref{['thm:main_test_error_RFRR']}. In the figure on the left, points are finite size numerical experiments. Dashed and dotted lines are the analytical rates from Theorem \ref{['thm:rates:risk']}, stated in the legend. The colour scheme corresponds to the regions of Fig. \ref{['fig:rates_rlarge']}.
  • Figure 4: Relative difference between the excess risk (\ref{['eq:def:risk']}) of random features ridge regression from numerical simulation and its deterministic equivalent (\ref{['thm:main_test_error_RFRR']}), with regularization strength $\lambda = 0.1$, and noise variance $\sigma_\varepsilon^2 = 0.1$. The relative error is $O((n\wedge p)^{-1/2})$, in agreement with \ref{['eq:approximation_rate_test']}. The simulations are made following the procedure described in \ref{['app:numerics_gaussian_design']}, with $\xi_k = k^{-1.2}$ and $\beta_{*,k} = k^{-1.46}$; (left) $p = 3000$ fixed (right) $n = 3000$ fixed.
  • Figure 5: Excess risk \ref{['eq:def:risk']} of random features ridge regression. Solid lines are obtained from the deterministic equivalent in \ref{['thm:main_test_error_RFRR']}, and points are numerical simulations, with the different curves denoting different regularization strengths $\lambda\geq 0$. Training data $({\boldsymbol x}_{i},y_{i})_{i\in[n]}$, sampled from a teacher-student model $y_{i}=\operatorname{tanh}(\langle {\boldsymbol \beta}, {\boldsymbol x}_i\rangle) +\varepsilon_i$, $\sigma_\varepsilon^2=0.1$, with random feature map $\varphi({\boldsymbol x}, {\boldsymbol w}) = \operatorname{ReLU}(\langle {\boldsymbol w},{\boldsymbol x}\rangle)$. Both covariates $\{{\boldsymbol x}_i\}$ and weights $\{{\boldsymbol w}_i\}$ are uniformly sampled from the $d$-dimensional spheres respectively with radius $\sqrt{d}$ and $1$. (Left) Excess risk as a function of $n$, with $p = 600$ fixed. (Right) Excess risk as a function of $p$, with $n = 500$ fixed.
  • ...and 4 more figures

Theorems & Definitions (46)

  • Definition 1: Deterministic equivalents
  • Theorem 3.3: Test error of RFRR
  • Corollary 3.4: Kernel limit
  • Corollary 3.5: Approximation limit
  • Theorem 4.1: Excess risk rates
  • Remark 4.1
  • Corollary 4.2: Optimal rates
  • Definition 2: Effective regularization
  • Definition 3: Intrinsic dimension
  • Theorem A.2: Dimension-free deterministic equivalents
  • ...and 36 more