No Free Lunch From Random Feature Ensembles: Scaling Laws and Near-Optimality Conditions
Benjamin S. Ruben, William L. Tong, Hamza Tahir Chaudhry, Cengiz Pehlevan
TL;DR
This paper analyzes how to allocate a fixed parameter budget between a single large random-feature ridge regression model and an ensemble of smaller models. Using deterministic equivalents and kernel-eigenstructure analysis, it proves a no-free-lunch theorem: with optimal ridge, no ensemble can beat a single large model, though ensembles can achieve near-optimal performance under certain spectral conditions. In the overparameterized regime, the leading error depends on the total feature budget $M=KN$, making ensembles effectively comparable to a single model at leading order; in the underparameterized regime, explicit scaling laws are derived via a growth exponent $\ell$, linking ensemble size and per-model size to the kernel and task structure. These results provide principled guidance for resource allocation in kernel/RFRR contexts and connect to broader scaling laws observed in deep learning, highlighting when feature learning or kernel alignment can yield near-optimal ensemble performance.
Abstract
Given a fixed budget for total model size, one must choose between training a single large model or combining the predictions of multiple smaller models. We investigate this trade-off for ensembles of random-feature ridge regression models in both the overparameterized and underparameterized regimes. Using deterministic equivalent risk estimates, we prove that when a fixed number of parameters is distributed among $K$ independently trained models, the ridge-optimized test risk increases with $K$. Consequently, a single large model achieves optimal performance. We then ask when ensembles can achieve \textit{near}-optimal performance. In the overparameterized regime, we show that, to leading order, the test error depends on ensemble size and model size only through the total feature count, so that overparameterized ensembles consistently achieve near-optimal performance. To understand underparameterized ensembles, we derive scaling laws for the test risk as a function of total parameter count when the ensemble size and parameters per ensemble member are jointly scaled according to a ``growth exponent'' $\ell$. While the optimal error scaling is always achieved by increasing model size with a fixed ensemble size, our analysis identifies conditions on the kernel and task eigenstructure under which near-optimal scaling laws can be obtained by joint scaling of ensemble size and model size.
