Table of Contents
Fetching ...

Learning Curves for Noisy Heterogeneous Feature-Subsampled Ridge Ensembles

Benjamin S. Ruben, Cengiz Pehlevan

TL;DR

This work develops a theory of feature-subsampled ridge ensembles under noisy, correlated data by applying the replica trick to derive analytical learning curves. It shows that subsampling can shift the double-descent peak for a linear predictor and introduces heterogeneous connectivity as a practical, scalable mitigation that remains effective in image-feature contexts. The authors further characterize an ensembling–subsampling trade-off under resource constraints, revealing phase-like regimes that determine optimal ensemble size and regularization. The results provide actionable guidance for designing robust, feature-subsampled ensembles in noisy, high-dimensional settings and demonstrate qualitative relevance to deep-feature based classification tasks.

Abstract

Feature bagging is a well-established ensembling method which aims to reduce prediction variance by combining predictions of many estimators trained on subsets or projections of features. Here, we develop a theory of feature-bagging in noisy least-squares ridge ensembles and simplify the resulting learning curves in the special case of equicorrelated data. Using analytical learning curves, we demonstrate that subsampling shifts the double-descent peak of a linear predictor. This leads us to introduce heterogeneous feature ensembling, with estimators built on varying numbers of feature dimensions, as a computationally efficient method to mitigate double-descent. Then, we compare the performance of a feature-subsampling ensemble to a single linear predictor, describing a trade-off between noise amplification due to subsampling and noise reduction due to ensembling. Our qualitative insights carry over to linear classifiers applied to image classification tasks with realistic datasets constructed using a state-of-the-art deep learning feature map.

Learning Curves for Noisy Heterogeneous Feature-Subsampled Ridge Ensembles

TL;DR

This work develops a theory of feature-subsampled ridge ensembles under noisy, correlated data by applying the replica trick to derive analytical learning curves. It shows that subsampling can shift the double-descent peak for a linear predictor and introduces heterogeneous connectivity as a practical, scalable mitigation that remains effective in image-feature contexts. The authors further characterize an ensembling–subsampling trade-off under resource constraints, revealing phase-like regimes that determine optimal ensemble size and regularization. The results provide actionable guidance for designing robust, feature-subsampled ensembles in noisy, high-dimensional settings and demonstrate qualitative relevance to deep-feature based classification tasks.

Abstract

Feature bagging is a well-established ensembling method which aims to reduce prediction variance by combining predictions of many estimators trained on subsets or projections of features. Here, we develop a theory of feature-bagging in noisy least-squares ridge ensembles and simplify the resulting learning curves in the special case of equicorrelated data. Using analytical learning curves, we demonstrate that subsampling shifts the double-descent peak of a linear predictor. This leads us to introduce heterogeneous feature ensembling, with estimators built on varying numbers of feature dimensions, as a computationally efficient method to mitigate double-descent. Then, we compare the performance of a feature-subsampling ensemble to a single linear predictor, describing a trade-off between noise amplification due to subsampling and noise reduction due to ensembling. Our qualitative insights carry over to linear classifiers applied to image classification tasks with realistic datasets constructed using a state-of-the-art deep learning feature map.
Paper Structure (39 sections, 2 theorems, 145 equations, 18 figures, 1 table)

This paper contains 39 sections, 2 theorems, 145 equations, 18 figures, 1 table.

Key Result

Proposition 1

Consider the ensembled ridge regression problem described in Section Setup. Consider the asymptotic limit where $M, P, \{N_r\} \to \infty$ while the ratios $\alpha = \frac{P}{M}$ and $\nu_{rr} = \frac{N_r}{M}$, $r = 1,\dots,k$ remain fixed. Define the following quantities: Then the terms of the average generalization error (eq. SingleErrorTerm) may be written as: where the pairs of order paramet

Figures (18)

  • Figure 1: Comparison of numerical and theoretical learning curves for ensembled linear regression. Circles represent numerical results averaged over 100 trials; lines indicate theoretical predictions. Error bars represent the standard error of the mean but are often smaller than the markers. (a) Testing of proposition \ref{['Proposition1']} with $M = 2000$, $\left[\bm{\Sigma}_s \right]_{ij} = .8^{|i-j|}$, $\left[\bm{\Sigma}_0 \right]_{ij} = \frac{1}{10}(0.3)^{|i-j|}$, $\zeta = 0.1$, and all $\eta_r = 0.2$ and $\lambda_r = \lambda$ (see legend). $k=3$ linear predictors access fixed, randomly selected (with replacement) subsets of the features with fractional sizes $\nu_{rr} =0.2, 0.4, 0.6$. Fixed ground-truth weights $\bm{w}^*$ are drawn from an isotropic Gaussian distribution. (b) Testing of proposition \ref{['EquiCorrProp']} with $M = 5000$, $s = 1$, $c = 0.6$, $\omega^2 = 0.1$, $\zeta = 0.1$, all $\eta_r = 0.1$, and all $\lambda_r = \lambda$ (see legend). Ground truth weights sampled as in eq. \ref{['alignedGT']} with $\rho = 0.3$. Feature subsets accessed by each readout are mutually exclusive (inset) with fractional sizes $\nu_{rr} = 0.1,0.3,0.5$.
  • Figure 2: Subsampling alters the location of the double-descent peak of a linear predictor. (a) Illustrations of subsampled linear predictors with varying subsampling fraction $\nu$. (b) Comparison between experiment and theory for subsampling linear regression on equicorrelated datasets. We choose task parameters as in proposition \ref{['EquiCorrProp']} with $c=\omega = \zeta = \eta = 0$, $s=1$, and (i) $\lambda = 0$, (ii) $\lambda = 10^{-4}$, (iii) $\lambda = 10^{-2}$. All learning curves are for a single linear predictor $k=1$ with subsampling fraction $\nu$ shown in legend. Circles show results of numerical experiment. Lines are analytical prediction.
  • Figure 3: Heterogeneous ensembling mitigates double-descent. (a) We compare (i) homogeneous ensembling, in which $k$ readouts connect to the same fraction $\nu = 1/k$ of features, and (ii) heterogeneous ensembling (b) In heterogeneous ensembling subsampling fractions are drawn i.i.d. from $\Gamma_{k,\sigma}(\nu)$, shown here for $k=10$, then re-scaled to sum to 1. (c) Generalization Error Curves for Homogeneous and Heterogeneous ensembling with $k = 10$, $\zeta = 0$, $\rho = 0.3$ and indicated values of $\lambda$, $c$, and $\eta$. Blue: homogeneous subsampling. Red, green, and purple show heterogeneous subsampling with $\sigma = 0.25/k, 0.5/k, 1/k$ respectively. Dashed lines show learning curves for 3 particular realizations of $\{\nu_{11}, \dots, \nu_{kk}\}$. Solid curves show the average over 100 realizations. Gray shows the learning curve for a single linear readout with $\nu = 1$ and optimal regularization (eq. \ref{['optreg_local']}). Triangular marks show the asymptotic generalization error ($\alpha \to \infty$), with downward-pointing gray triangles indicating an asymptotic error of zero. (d,e) Generalization error of linear classifiers applied to the imagewoof dataset with ResNext features averaged over 100 trials. (d) $P=100$, $k = 1$ varying subsampling fraction $\nu$ and regularization $\lambda$ (legend). (e) Generalization error of (i) homogeneous and (ii) heterogeneous (with $\sigma = 0.75/k$) ensembles of classifiers. Legend indicates $k$ values. $\lambda = 0$ except for gray curves, where $\lambda = 0.1$
  • Figure 4: Task parameters dictate the ensembling-subsampling trade-off: (a-d) In the setting of proposition \ref{['EquiCorrProp']} in the special case where all $\nu_{rr'} = \frac{1}{k} \delta_{rr'}$ so that feature subsets are mutually exclusive and the total number of weights is conserved. (a) We plot the reduced generalization errors $\mathcal{E}$ (for $\lambda = 0$, using eq. \ref{['HomEnsErr_Ridgeless']}) and $\mathcal{E^*}$ (for $\lambda = \lambda^*$ using eq. \ref{['HomEnsErr_LocOpt']}) of linear ridge ensembles of varying size $k$ with $\rho = 0$ and $H = 0,1$ (values indicated above plots). Grey lines indicate $k=1$, dashed black lines $k \to \infty$, and intermediate $k$ values by the colorbar. (b) We plot optimal ensemble size $k^*$ (eqs. \ref{['kstar_ridgeless']}, \ref{['kstar_opt']}) in the parameter space of sample size $\alpha$ and reduced readout noise scale $H$ setting $W=Z=0$. Grey indicates $k^* = 1$ and white indicates $k^* = \infty$, with intermediate values given by the colorbar. Appended vertical bars show $\alpha \to \infty$. Dotted black lines show the analytical boundary between the intermediate and noise-dominated phases given by eq. \ref{['AnalyticalBoundaryZeroReg']}. (c) optimal readout $k^*$ phase diagrams as in (b) but showing $W$-dependence with $H = Z = 0$. (d) optimal readout $k^*$ phase diagrams as in (b) but showing $Z$-dependence with $H = W = 0$. (e) Learning curves for feature-subsampling ensembles of linear classifiers combined using a majority vote rule on the imagewoof classification task (see Appendix \ref{['ResNextAppendix']}). As in (a-d) we set $\nu_{rr'} = \frac{1}{k} \delta_{rr'}$. Error is calculated as the probability of incorrectly classifying a test example. $\lambda$ and $\eta$ values are indicated in each panel. (f) Numerical phase diagrams showing the value of $k$ which minimizes test error in the parameter space of sample size $P$ and readout noise scale $\eta$, with regularization (i) $\lambda = 0$ (pseudoinverse rule) (ii) $\lambda = 0.1$.
  • Figure S1: In numerical experiments, we train linear classifiers to predict labels of imagenet images based on their last-hidden-layer representations in a pre-trained RexNext deep learning architecture ResNextxie2017. Here, we show the structure of the datasets constructed using the ResNext feature map for the Imagenette task (left), which consists of categorizing images from 10 unrelated categories, and the Imagewoof task (right), which consists of categorizing images from 10 different dog breeds. (a) Gram matrix of the centered ResNext features defined as $\frac{1}{P} \left( \bm{\Phi}-\bar{\bm{\Phi}} \right)^\top \left( \bm{\Phi}-\bar{\bm{\Phi}} \right)$ for data matrix $\Phi \in \mathbb{R}^{P \times M}$ where $P$ is the total size of the dataset. Dataset is sorted by label and tick marks show the boundaries between classes. (b) The covariance eigenspectrum of the ResNext features is well described by a power law decay. (c) Generalization error of Linear classification with a single linear predictor with access to a fraction $\nu = N/M$ of the ResNext features averaged over 100 trials (see discussion in section \ref{['ResNextSubsampSection']})
  • ...and 13 more figures

Theorems & Definitions (8)

  • Proposition 1
  • proof
  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Proposition 2
  • proof