Dimension-free deterministic equivalents and scaling laws for random feature regression

Leonardo Defilippis; Bruno Loureiro; Theodor Misiakiewicz

Dimension-free deterministic equivalents and scaling laws for random feature regression

Leonardo Defilippis, Bruno Loureiro, Theodor Misiakiewicz

TL;DR

A general deterministic equivalent for the test error of RFRR is shown, under a certain concentration property, to be well approximated by a closed-form expression that only depends on the feature map eigenvalues.

Abstract

In this work we investigate the generalization performance of random feature ridge regression (RFRR). Our main contribution is a general deterministic equivalent for the test error of RFRR. Specifically, under a certain concentration property, we show that the test error is well approximated by a closed-form expression that only depends on the feature map eigenvalues. Notably, our approximation guarantee is non-asymptotic, multiplicative, and independent of the feature map dimension -- allowing for infinite-dimensional features. We expect this deterministic equivalent to hold broadly beyond our theoretical analysis, and we empirically validate its predictions on various real and synthetic datasets. As an application, we derive sharp excess error rates under standard power-law assumptions of the spectrum and target decay. In particular, we provide a tight result for the smallest number of features achieving optimal minimax error rate.

Dimension-free deterministic equivalents and scaling laws for random feature regression

TL;DR

Abstract

Paper Structure (36 sections, 25 theorems, 350 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 36 sections, 25 theorems, 350 equations, 9 figures, 2 tables, 1 algorithm.

Introduction
Further related works ---
Setting
Deterministic equivalents
Main result ---
Particular limits ---
Scaling laws
Context ---
Results ---
Relationship to scaling laws ---
Background on deterministic equivalents
Proof of the deterministic equivalent for RFRR
Preliminaries
Fixed points, feature covariance matrix, and tail rank
Deterministic equivalents for functionals of $\bold{Z}$ conditional on $\bold{F}$
...and 21 more sections

Key Result

Theorem 3.3

Under Assumptions ass:concentration_eigenfunctions, ass:technical and for any $D,K >0$, there exist constants $\eta_* \in (0,1/2)$ and $C_{*,D,K}>0$ such that the following holds. For any $n , p \geq C_{*,D,K}$, regularization $\lambda >0$, and target function $f_{\star} \in L_{2}(\mu_x)$, if then with probability at least $1 - n^{-D} - p^{-D}$, we have where $\mathsf R_{n,p}({\boldsymbol \beta}

Figures (9)

Figure 1: Excess risk \ref{['eq:def:risk']} of RFRR as a function of the number of features $p$ for a fixed number of samples $n$. Solid lines are obtained from the deterministic equivalent in \ref{['thm:main_test_error_RFRR']}, and points are numerical simulations, with the different curves denoting different regularization strengths $\lambda\geq 0$. (Left) Training data $({\boldsymbol x}_{i},y_{i})_{i\in[n]}$, $n = 500$, sampled from a teacher-student model $y_{i}=\erf(\langle {\boldsymbol \beta}, {\boldsymbol x}_i\rangle) +\varepsilon_i$, $\sigma_\varepsilon^2=0.1$, ${\boldsymbol x}_{i}\sim_{\text{i.i.d.}}\mathcal{N}(0,{\boldsymbol I}_{d})$, with a spiked random feature map $\varphi({\boldsymbol x}, {\boldsymbol w}) = \operatorname{tanh}(\langle {\boldsymbol w} + u{\boldsymbol v},{\boldsymbol x}\rangle)$ where ${\boldsymbol v}\in\mathbb{R}^{d}$ has a fixed overlap $\gamma = \langle {\boldsymbol v}, {\boldsymbol \beta}\rangle$ with the teacher vector, ${\boldsymbol w}\sim\mathcal{N}(0,d^{-1}{\boldsymbol I}_d)$, $u\sim\mathcal{N}(0,1)$. (Right) Training data $({\boldsymbol x}_{i}, y_{i})_{i\in[n]}$, $n = 300$, sub-sampled from the FashionMNIST data set xiao2017fashionmnist, with feature map given by $\varphi({\boldsymbol x};{\boldsymbol w}) = \erf(\langle {\boldsymbol w}, {\boldsymbol x}\rangle)$ and $\mu_w=\mathcal{N}(0,d^{-1}{\boldsymbol I}_{d})$.
Figure 2: Excess error rate $\gamma$ in the regime $n \gg \sigma_\varepsilon ^{-1/(\gamma_{\mathcal{B}}(\ell, q) - \gamma_{{\mathcal{V}}}(\ell, q))}$ as a function of $(\ell, q)$, defined in \ref{['eq:rates:risk']} and \ref{['def:feat_reg_powerlaw']} for $r\geq 1/2$ ( Left) and $r\in [0,1/2)$ ( Right). The explicit crossover points $\ell_{\star}, q_{\star}, \hat{q}$ are defined in \ref{['eq:rates:crossover']} as a function of the source $r$ and capacity $\alpha$ exponents.
Figure 3: Excess risk \ref{['eq:def:risk']} of RFRR as a function of the number of samples $n$ under source and capacity conditions \ref{['eq:def:sourcecapacity']} and power-law assumptions $\lambda = n^{-(\ell-1)}$, $p=n^{q}$, with noise variance $\sigma_\varepsilon^2 = 0.1$. Solid lines are obtained from the deterministic equivalent \ref{['thm:main_test_error_RFRR']}. In the figure on the left, points are finite size numerical experiments. Dashed and dotted lines are the analytical rates from Theorem \ref{['thm:rates:risk']}, stated in the legend. The colour scheme corresponds to the regions of Fig. \ref{['fig:rates_rlarge']}.
Figure 4: Relative difference between the excess risk (\ref{['eq:def:risk']}) of random features ridge regression from numerical simulation and its deterministic equivalent (\ref{['thm:main_test_error_RFRR']}), with regularization strength $\lambda = 0.1$, and noise variance $\sigma_\varepsilon^2 = 0.1$. The relative error is $O((n\wedge p)^{-1/2})$, in agreement with \ref{['eq:approximation_rate_test']}. The simulations are made following the procedure described in \ref{['app:numerics_gaussian_design']}, with $\xi_k = k^{-1.2}$ and $\beta_{*,k} = k^{-1.46}$; (left) $p = 3000$ fixed (right) $n = 3000$ fixed.
Figure 5: Excess risk \ref{['eq:def:risk']} of random features ridge regression. Solid lines are obtained from the deterministic equivalent in \ref{['thm:main_test_error_RFRR']}, and points are numerical simulations, with the different curves denoting different regularization strengths $\lambda\geq 0$. Training data $({\boldsymbol x}_{i},y_{i})_{i\in[n]}$, sampled from a teacher-student model $y_{i}=\operatorname{tanh}(\langle {\boldsymbol \beta}, {\boldsymbol x}_i\rangle) +\varepsilon_i$, $\sigma_\varepsilon^2=0.1$, with random feature map $\varphi({\boldsymbol x}, {\boldsymbol w}) = \operatorname{ReLU}(\langle {\boldsymbol w},{\boldsymbol x}\rangle)$. Both covariates $\{{\boldsymbol x}_i\}$ and weights $\{{\boldsymbol w}_i\}$ are uniformly sampled from the $d$-dimensional spheres respectively with radius $\sqrt{d}$ and $1$. (Left) Excess risk as a function of $n$, with $p = 600$ fixed. (Right) Excess risk as a function of $p$, with $n = 500$ fixed.
...and 4 more figures

Theorems & Definitions (46)

Definition 1: Deterministic equivalents
Theorem 3.3: Test error of RFRR
Corollary 3.4: Kernel limit
Corollary 3.5: Approximation limit
Theorem 4.1: Excess risk rates
Remark 4.1
Corollary 4.2: Optimal rates
Definition 2: Effective regularization
Definition 3: Intrinsic dimension
Theorem A.2: Dimension-free deterministic equivalents
...and 36 more

Dimension-free deterministic equivalents and scaling laws for random feature regression

TL;DR

Abstract

Dimension-free deterministic equivalents and scaling laws for random feature regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (46)