Table of Contents
Fetching ...

No Free Lunch From Random Feature Ensembles: Scaling Laws and Near-Optimality Conditions

Benjamin S. Ruben, William L. Tong, Hamza Tahir Chaudhry, Cengiz Pehlevan

TL;DR

This paper analyzes how to allocate a fixed parameter budget between a single large random-feature ridge regression model and an ensemble of smaller models. Using deterministic equivalents and kernel-eigenstructure analysis, it proves a no-free-lunch theorem: with optimal ridge, no ensemble can beat a single large model, though ensembles can achieve near-optimal performance under certain spectral conditions. In the overparameterized regime, the leading error depends on the total feature budget $M=KN$, making ensembles effectively comparable to a single model at leading order; in the underparameterized regime, explicit scaling laws are derived via a growth exponent $\ell$, linking ensemble size and per-model size to the kernel and task structure. These results provide principled guidance for resource allocation in kernel/RFRR contexts and connect to broader scaling laws observed in deep learning, highlighting when feature learning or kernel alignment can yield near-optimal ensemble performance.

Abstract

Given a fixed budget for total model size, one must choose between training a single large model or combining the predictions of multiple smaller models. We investigate this trade-off for ensembles of random-feature ridge regression models in both the overparameterized and underparameterized regimes. Using deterministic equivalent risk estimates, we prove that when a fixed number of parameters is distributed among $K$ independently trained models, the ridge-optimized test risk increases with $K$. Consequently, a single large model achieves optimal performance. We then ask when ensembles can achieve \textit{near}-optimal performance. In the overparameterized regime, we show that, to leading order, the test error depends on ensemble size and model size only through the total feature count, so that overparameterized ensembles consistently achieve near-optimal performance. To understand underparameterized ensembles, we derive scaling laws for the test risk as a function of total parameter count when the ensemble size and parameters per ensemble member are jointly scaled according to a ``growth exponent'' $\ell$. While the optimal error scaling is always achieved by increasing model size with a fixed ensemble size, our analysis identifies conditions on the kernel and task eigenstructure under which near-optimal scaling laws can be obtained by joint scaling of ensemble size and model size.

No Free Lunch From Random Feature Ensembles: Scaling Laws and Near-Optimality Conditions

TL;DR

This paper analyzes how to allocate a fixed parameter budget between a single large random-feature ridge regression model and an ensemble of smaller models. Using deterministic equivalents and kernel-eigenstructure analysis, it proves a no-free-lunch theorem: with optimal ridge, no ensemble can beat a single large model, though ensembles can achieve near-optimal performance under certain spectral conditions. In the overparameterized regime, the leading error depends on the total feature budget , making ensembles effectively comparable to a single model at leading order; in the underparameterized regime, explicit scaling laws are derived via a growth exponent , linking ensemble size and per-model size to the kernel and task structure. These results provide principled guidance for resource allocation in kernel/RFRR contexts and connect to broader scaling laws observed in deep learning, highlighting when feature learning or kernel alignment can yield near-optimal ensemble performance.

Abstract

Given a fixed budget for total model size, one must choose between training a single large model or combining the predictions of multiple smaller models. We investigate this trade-off for ensembles of random-feature ridge regression models in both the overparameterized and underparameterized regimes. Using deterministic equivalent risk estimates, we prove that when a fixed number of parameters is distributed among independently trained models, the ridge-optimized test risk increases with . Consequently, a single large model achieves optimal performance. We then ask when ensembles can achieve \textit{near}-optimal performance. In the overparameterized regime, we show that, to leading order, the test error depends on ensemble size and model size only through the total feature count, so that overparameterized ensembles consistently achieve near-optimal performance. To understand underparameterized ensembles, we derive scaling laws for the test risk as a function of total parameter count when the ensemble size and parameters per ensemble member are jointly scaled according to a ``growth exponent'' . While the optimal error scaling is always achieved by increasing model size with a fixed ensemble size, our analysis identifies conditions on the kernel and task eigenstructure under which near-optimal scaling laws can be obtained by joint scaling of ensemble size and model size.

Paper Structure

This paper contains 37 sections, 3 theorems, 62 equations, 12 figures.

Key Result

Theorem 4.1

(More is better for RF Ensembles) Let $E_g^K(P, N, \lambda)$ denote $E_g^K$ with $P$ training samples, $N$ random features per ensemble member, ensemble size $K$, and ridge parameter $\lambda$ and any task eigenstructure $\{\eta_t\}_{t=1}^\infty$, $\{\bar{w}_t\}_{t=1}^\infty$, where $\{\eta_t\}_{t=1 with strict inequality as long as $(K', N', P') \neq (K, N, P)$ and $\sum_t \bar{w}_t^2 \eta_t>0$.

Figures (12)

  • Figure 1: "More is better" in random feature ensembles. We perform $\mathop{\mathrm{ReLU}}\nolimits$ RFRR on a binarized CIFAR-10 classification task and compare the empirical test risk to the omniscient risk estimate (eq. \ref{['EgK']}). (A) We fix $N=256$ and vary both $P$ and $K$. Color corresponds to the regularization $\lambda$. Markers show numerical experiments and dotted lines theoretical predictions. Error is monotonically decreasing with $P$ provided that the regularization $\lambda$ is tuned to its optimal value. (B) Same as (A) except that $P=256$ is fixed and $K$, $N$ are varied. Markers and error bars show mean and standard deviation over $50$ trials.
  • Figure 2: No Free Lunch from Random Feature Ensembles. We perform kernel RF regression on a binarized CIFAR 10 classification task. (A) We vary $K$ and $N$ while keeping total parameter count $M = 1024$ fixed. The sample size $P$ is indicated above each plot. (B) Error $E_g^K$ optimized over the ridge parameter $\lambda$ increases monotonically with $K$ provided the total parameter count $M$ is fixed. Dashed lines show theoretical prediction using eq. \ref{['EgK']} and markers and error-bars show mean and standard deviation of the risk measured in numerical simulations across 10 trials. (C) We show error as a function of $\lambda$ for each $K$ value simulated and $P = 8192$. Dashed lines show theoretical prediction using eq. \ref{['EgK']} and shaded regions show standard deviation of risk measured in numerical simulations across 10 trials.
  • Figure 3: Graphical Depiction of how ensemble size $K$ and model size $N$ scale with total feature count $M$ under different growth exponents $\ell$ (see eq. \ref{['JointScaling']}).
  • Figure 4: width-bottlenecked scaling laws of kernel RF regression under source and capacity constraints. We fix $P = 15,000$, $\alpha = 1.5$, and $r \in \{0.4, 0.8, 1.2\}$ and calculate $E_g^K$ as a function of $M$ with $N = M^\ell$ and $K = M^{(1-\ell)}$ using both the omniscient risk estimate (eq. \ref{['EgK']}) and numerical simulation of a linear Gaussian random-feature model (eq. \ref{['LinearRFModel']}). (A) Plots of $E_g^K$ vs. $M$ at different $\ell$ values reveal that $\ell$ controls the scaling law of the error. (B) We plot the theoretical scaling exponents (eq. \ref{['ScalingLaws']}): $\operatorname{Bias} \sim 2 \alpha \ell \min (r, 1)$, $\operatorname{Var} \sim 1-\ell + 2 \alpha \ell \min(r, \frac{1}{2})$ along with the scaling laws obtained by fitting the risks obtained by numerical simulation.
  • Figure 5: Scaling laws provide a good description of width-bottlenecked RFRR ensembles.(A) we plot error as a function of $M$ at optimal ridge value for $\mathop{\mathrm{ReLU}}\nolimits$ random-feature models applied to the binarized CIFAR-10 (left) and MNIST (right) classification tasks. (B) We plot theoretically predicted scaling exponents (eq. \ref{['ScalingLaws']}) for the bias and variance contributions to risk, as well as empirical power-law fits to risk in numerical simulations of RFRR models (see Appendix \ref{['TheoreticalPredictionsMethods']}, fig. \ref{['fig:spectra_cifar_mnist']})
  • ...and 7 more figures

Theorems & Definitions (9)

  • Theorem 4.1
  • Remark 4.2
  • Remark 4.3
  • Theorem 5.1
  • Remark 5.2
  • Remark 5.3
  • Remark 5.4
  • Corollary 5.5
  • Remark 5.6