Subsampled Ensemble Can Improve Generalization Tail Exponentially

Huajie Qian; Donghao Ying; Henry Lam; Wotao Yin

Subsampled Ensemble Can Improve Generalization Tail Exponentially

Huajie Qian, Donghao Ying, Henry Lam, Wotao Yin

TL;DR

The paper tackles heavy-tailed generalization in data-driven learning and optimization by introducing MoVE and ROvE, ensemble methods that select the mode or an epsilon-optimal model from multiple subsampled base learners. By voting on models trained on random subsets, the approach converts polynomial excess-risk tails into exponential tails, with formal finite-sample guarantees showing exponential decay in the tail probability $\mathbb{P}\left(L(\hat{\theta}_n)>\min_{\theta}L(\theta)+\delta\right)$. The contributions include the MoVE framework for discrete decision spaces and the ROvE/ROvEs two-phase procedures for continuous spaces, each backed by theoretical bounds and extensive numerical validation on neural networks, trees, and stochastic programs under heavy-tailed noise. This yields a practical, theoretically grounded method that substantially improves out-of-sample performance in challenging tail regimes across ML and optimization tasks.

Abstract

Ensemble learning is a popular technique to improve the accuracy of machine learning models. It traditionally hinges on the rationale that aggregating multiple weak models can lead to better models with lower variance and hence higher stability, especially for discontinuous base learners. In this paper, we provide a new perspective on ensembling. By selecting the most frequently generated model from the base learner when repeatedly applied to subsamples, we can attain exponentially decaying tails for the excess risk, even if the base learner suffers from slow (i.e., polynomial) decay rates. This tail enhancement power of ensembling applies to base learners that have reasonable predictive power to begin with and is stronger than variance reduction in the sense of exhibiting rate improvement. We demonstrate how our ensemble methods can substantially improve out-of-sample performances in a range of numerical examples involving heavy-tailed data or intrinsically slow rates.

Subsampled Ensemble Can Improve Generalization Tail Exponentially

TL;DR

. The contributions include the MoVE framework for discrete decision spaces and the ROvE/ROvEs two-phase procedures for continuous spaces, each backed by theoretical bounds and extensive numerical validation on neural networks, trees, and stochastic programs under heavy-tailed noise. This yields a practical, theoretically grounded method that substantially improves out-of-sample performance in challenging tail regimes across ML and optimization tasks.

Abstract

Paper Structure (34 sections, 14 theorems, 135 equations, 21 figures, 3 algorithms)

This paper contains 34 sections, 14 theorems, 135 equations, 21 figures, 3 algorithms.

Introduction
Methodology and Theoretical Guarantees
A Basic Procedure
A More General Procedure
Numerical Experiments
Neural Networks and Trees for Regression
Stochastic Programs
Related Work
Conclusion and Limitations
Implications of Theorem \ref{['thm: general majority vote']} for Strong Base Learners
Proofs for Main Theoretical Results
Preliminaries
Proof of Theorem \ref{['thm: general majority vote']}
Proof of Corollary \ref{['cor: application of move to linear program example']}
Proof of Theorem \ref{['thm: finite-sample bound for multiple predictions two phase splitting']}
...and 19 more sections

Key Result

Theorem 2.1

Consider discrete decision space $\Theta$. Let $\Theta^{\delta}:=\left\{\theta\in\Theta:L(\theta)\leq \min_{\theta'\in\Theta}L(\theta')+\delta\right\}$ be the set of $\delta$-optimal models and where $p_k(\theta)$ is defined in (prob:maximizing selection probability) and $\max_{\theta\in\Theta\backslash\Theta^{\delta}}p_k(\theta)$ evaluates to $0$ if $\Theta\backslash\Theta^{\delta}$ is empty. Th

Figures (21)

Figure 1: Results of neural networks. (a)(b)(d)(e): Expected out-of-sample costs (MSE) with $95\%$ confidence intervals under different noise distributions and varying numbers of hidden layers ($H$). (c) and (f): Tail probabilities of out-of-sample costs.
Figure 2: Comparison with bagging in terms of expected out-of-sample costs (MSE) with $95\%$ confidence intervals (a-c) or tail probabilities (d-f) under varying degrees of tail heaviness. Hyperparameters: $k_1 = \max(30, n/2), k_2 = \max(30, n/1000), B_1 = 50, B_2 = 1000$.
Figure 3: Results of decision trees in terms of tail probabilities of out-of-sample costs (MSE). Hyperparameters: $k_1=\max(30,n/10),k_2=\max(30,n/200),B_1=50,B_2=200$.
Figure 4: Results of neural networks with $4$ hidden layers on three real datasets, in terms of tail probabilities of out-of-sample costs (MSE).
Figure 5: Results for stochastic programs. (a)-(e): Expected out-of-sample costs with $95\%$ confidence intervals. (f): Running time comparison in the network design problem.
...and 16 more figures

Theorems & Definitions (23)

Example 1.1: LP with a polynomial tail
Theorem 2.1: Informal bound for Algorithm \ref{['bagging majority vote: set estimator']}
Corollary 2.2: Enhanced tail for Example \ref{['ex: linear_program']}
Theorem 2.3: Informal bound for Algorithm \ref{['bagging majority vote: two phase']}
Definition B.1
Lemma B.2: MGF dominance of U-statistics from hoeffding1963probability
Lemma B.3: Concentration bound for U-statistics with binary kernels
Lemma B.4
Definition B.5
Definition B.6
...and 13 more

Subsampled Ensemble Can Improve Generalization Tail Exponentially

TL;DR

Abstract

Subsampled Ensemble Can Improve Generalization Tail Exponentially

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (23)