Table of Contents
Fetching ...

Error Reduction from Stacked Regressions

Xin Chen, Jason M. Klusowski, Yan Shuo Tan

TL;DR

This paper shows that thanks to an adaptive shrinkage effect, the resulting stacked estimator has strictly smaller population risk than best single estimator among them, with more significant gains when the signal-to-noise ratio is small.

Abstract

Stacking regressions is an ensemble technique that forms linear combinations of different regression estimators to enhance predictive accuracy. The conventional approach uses cross-validation data to generate predictions from the constituent estimators, and least-squares with nonnegativity constraints to learn the combination weights. In this paper, we learn these weights analogously by minimizing a regularized version of the empirical risk subject to a nonnegativity constraint. When the constituent estimators are linear least-squares projections onto nested subspaces separated by at least three dimensions, we show that thanks to an adaptive shrinkage effect, the resulting stacked estimator has strictly smaller population risk than best single estimator among them, with more significant gains when the signal-to-noise ratio is small. Here "best" refers to an estimator that minimizes a model selection criterion such as AIC or BIC. In other words, in this setting, the best single estimator is inadmissible. Because the optimization problem can be reformulated as isotonic regression, the stacked estimator requires the same order of computation as the best single estimator, making it an attractive alternative in terms of both performance and implementation.

Error Reduction from Stacked Regressions

TL;DR

This paper shows that thanks to an adaptive shrinkage effect, the resulting stacked estimator has strictly smaller population risk than best single estimator among them, with more significant gains when the signal-to-noise ratio is small.

Abstract

Stacking regressions is an ensemble technique that forms linear combinations of different regression estimators to enhance predictive accuracy. The conventional approach uses cross-validation data to generate predictions from the constituent estimators, and least-squares with nonnegativity constraints to learn the combination weights. In this paper, we learn these weights analogously by minimizing a regularized version of the empirical risk subject to a nonnegativity constraint. When the constituent estimators are linear least-squares projections onto nested subspaces separated by at least three dimensions, we show that thanks to an adaptive shrinkage effect, the resulting stacked estimator has strictly smaller population risk than best single estimator among them, with more significant gains when the signal-to-noise ratio is small. Here "best" refers to an estimator that minimizes a model selection criterion such as AIC or BIC. In other words, in this setting, the best single estimator is inadmissible. Because the optimization problem can be reformulated as isotonic regression, the stacked estimator requires the same order of computation as the best single estimator, making it an attractive alternative in terms of both performance and implementation.
Paper Structure (32 sections, 12 theorems, 146 equations, 6 figures)

This paper contains 32 sections, 12 theorems, 146 equations, 6 figures.

Key Result

Theorem 4.1

Suppose $0 < \tau < 2$ and $d_k \geq d_{k-1} + 4/(2-\tau)$ for all $k$. The population risk of the stacked model with weights from loss:lasso is strictly less than the population risk of the data-selected best single model eq:best; furthermore, if $d_k \geq d_{k-1} + 5/(2-\tau)$ for all $k$, there e

Figures (6)

  • Figure 1: MSE comparison across different methods, where the function $f$ is linear and depends on the first 35 covariates.
  • Figure 2: MSE comparison across different methods, where the function $f$ is linear and depends on the first 60 covariates.
  • Figure 3: MSE comparison across different methods, where the function $f$ is nonlinear and depends on the first 35 covariates.
  • Figure 4: MSE comparison across different methods, where the function $f$ is nonlinear and depends on the first 60 covariates.
  • Figure B.5: MSE comparison across different methods, where the function $f$ is linear and depends on the first 60 covariates, with noise drawn from a Laplace distribution.
  • ...and 1 more figures

Theorems & Definitions (29)

  • Remark 1
  • Theorem 4.1
  • Remark 2
  • Lemma 5.1
  • Theorem 5.2
  • Theorem 7.1
  • Theorem 8.1
  • Remark 3
  • proof : Proof of Lemma \ref{['lem:equi']}
  • Lemma A.1
  • ...and 19 more