Table of Contents
Fetching ...

Model Selection in High-Dimensional Linear Regression using Boosting with Multiple Testing

George Kapetanios, Vasilis Sarafidis, Alexia Ventouri

TL;DR

This paper proposes a new approach, Boosting with Multiple Testing (BMT), which combines forward stepwise variable selection with the multiple testing framework of Chudik et al (2018), and shows that BMT yields sparse, interpretable specifications with favourable out-of-sample performance.

Abstract

High-dimensional regression specification and analysis is a complex and active area of research in statistics, machine learning, and econometrics. This paper proposes a new approach, Boosting with Multiple Testing (BMT), which combines forward stepwise variable selection with the multiple testing framework of Chudik et al (2018). At each stage, the model is updated by adding only the most significant regressor conditional on those already included, while a family-wise multiple testing filter is applied to the remaining candidates. In this way, the method retains the strong screening properties of Chudik et al (2018) while operating in a less greedy manner with respect to proxy and noise variables. Using sharp probability inequalities for heterogeneous strongly mixing processes from Dendramis et al (2022), we show that BMT enjoys oracle type properties relative to an approximating model that includes all true signals and excludes pure noise variables: this model is selected with probability tending to one, and the resulting estimator achieves standard parametric rates for prediction error and coefficient estimation. Additional results establish conditions under which BMT recovers the exact true model and avoids selection of proxy signals. Monte Carlo experiments indicate that BMT performs very well relative to OCMT and Lasso type procedures, delivering higher model selection accuracy and smaller RMSE for the estimated coefficients, especially under strong multicollinearity of the regressors. Two empirical illustrations based on a large set of macro-financial indicators as covariates, show that BMT yields sparse, interpretable specifications with favourable out-of-sample performance.

Model Selection in High-Dimensional Linear Regression using Boosting with Multiple Testing

TL;DR

This paper proposes a new approach, Boosting with Multiple Testing (BMT), which combines forward stepwise variable selection with the multiple testing framework of Chudik et al (2018), and shows that BMT yields sparse, interpretable specifications with favourable out-of-sample performance.

Abstract

High-dimensional regression specification and analysis is a complex and active area of research in statistics, machine learning, and econometrics. This paper proposes a new approach, Boosting with Multiple Testing (BMT), which combines forward stepwise variable selection with the multiple testing framework of Chudik et al (2018). At each stage, the model is updated by adding only the most significant regressor conditional on those already included, while a family-wise multiple testing filter is applied to the remaining candidates. In this way, the method retains the strong screening properties of Chudik et al (2018) while operating in a less greedy manner with respect to proxy and noise variables. Using sharp probability inequalities for heterogeneous strongly mixing processes from Dendramis et al (2022), we show that BMT enjoys oracle type properties relative to an approximating model that includes all true signals and excludes pure noise variables: this model is selected with probability tending to one, and the resulting estimator achieves standard parametric rates for prediction error and coefficient estimation. Additional results establish conditions under which BMT recovers the exact true model and avoids selection of proxy signals. Monte Carlo experiments indicate that BMT performs very well relative to OCMT and Lasso type procedures, delivering higher model selection accuracy and smaller RMSE for the estimated coefficients, especially under strong multicollinearity of the regressors. Two empirical illustrations based on a large set of macro-financial indicators as covariates, show that BMT yields sparse, interpretable specifications with favourable out-of-sample performance.
Paper Structure (18 sections, 11 theorems, 135 equations, 9 figures, 45 tables)

This paper contains 18 sections, 11 theorems, 135 equations, 9 figures, 45 tables.

Key Result

Theorem 1

Consider the DGP (dgp1) with $k$ signals, $k^{\ast}$ pseudo-signals, and $n-k-k^{\ast}$ noise variables, and suppose that Assumptions ass0-ass2 and ass10 hold, Assumption ass_proj holds for $x_{it}$ and $\boldsymbol{q}_{\cdot t}=\boldsymbol{x}_{(j-1),t}$, $i\in\mathfrak{A}_{\left( j\right) }$, $j=

Figures (9)

  • Figure 1: MCC Performance Evaluation over different (T,n) values, $k=4$, $\alpha=0.8$
  • Figure 2: Relative RMSE Performance Evaluation over different (T,n) values, $k=4$, $\alpha=0.8$
  • Figure 3: Visual Summary of Performance Evaluation for VIF=4, $\pi=0.25$, $\alpha=0.8$
  • Figure 4: Visual Summary of Performance Evaluation for VIF=4, $\pi=0.75$, $\alpha=0.8$
  • Figure 5: Visual Summary of Performance Evaluation for VIF=2, $\pi=0.25$, $\alpha=0.8$
  • ...and 4 more figures

Theorems & Definitions (17)

  • Theorem 1
  • Remark 1: On the choice of $\delta$ in Theorem \ref{['th1']}
  • Theorem 2
  • Remark 2: Early stopping and comparison with OCMT
  • Theorem 3
  • Theorem 4
  • Remark 3: Robustness to heteroskedasticity
  • Remark 4: Interpretation of signal-to-proxy dominance condition (28) under canonical dependence structures
  • Theorem 5
  • Theorem 6
  • ...and 7 more