Table of Contents
Fetching ...

A generalized Bayesian approach for high-dimensional robust regression with serially correlated errors and predictors

Saptarshi Chakraborty, Kshitij Khare, George Michailidis

TL;DR

This work tackles high-dimensional robust regression under serially correlated errors by developing a generalized Bayesian framework built on a scaled pseudo-Huber (SPH) loss that adaptively balances $\ell_2$ and $\ell_1$ behavior. The authors formulate a SPH-based likelihood with latent scales and design priors for regression coefficients (ridge or spike-and-slab) and the robustness parameter $\alpha$, enabling uncertainty quantification without ad hoc tuning. They prove posterior-consistency results in both low- and high-dimensional regimes and demonstrate strong sparsity-pattern recovery under mild dependence assumptions, complemented by extensive simulations and a GDP forecast application. Empirically, SPH matches or surpasses traditional $\ell_1$/$\ell_2$ methods across heavy, moderate, and thin-tailed data, while offering calibrated uncertainty and robust variable selection in the presence of serial correlation and contamination. The practical impact lies in providing a scalable, robust Bayesian tool for high-dimensional regression in time-series and econometric contexts where outliers and dependence are pervasive.

Abstract

This paper introduces a loss-based generalized Bayesian methodology for high-dimensional robust regression with serially correlated errors and predictors. The proposed framework employs a novel scaled pseudo-Huber (SPH) loss function, which smooths the well-known Huber loss, effectively balancing quadratic ($\ell_2$) and absolute linear ($\ell_1$) loss behaviors. This flexibility enables the framework to accommodate both thin-tailed and heavy-tailed data efficiently. The generalized Bayesian approach constructs a working likelihood based on the SPH loss, facilitating efficient and stable estimation while providing rigorous uncertainty quantification for all model parameters. Notably, this approach allows formal statistical inference without requiring ad hoc tuning parameter selection while adaptively addressing a wide range of tail behavior in the errors. By specifying appropriate prior distributions for the regression coefficients--such as ridge priors for small or moderate-dimensional settings and spike-and-slab priors for high-dimensional settings--the framework ensures principled inference. We establish rigorous theoretical guarantees for accurate parameter estimation and correct predictor selection under sparsity assumptions for a wide range of data generating setups. Extensive simulation studies demonstrate the superior performance of our approach compared to traditional Bayesian regression methods based on $\ell_2$ and $\ell_1$-loss functions. The results highlight its flexibility and robustness, particularly in challenging high-dimensional settings characterized by data contamination.

A generalized Bayesian approach for high-dimensional robust regression with serially correlated errors and predictors

TL;DR

This work tackles high-dimensional robust regression under serially correlated errors by developing a generalized Bayesian framework built on a scaled pseudo-Huber (SPH) loss that adaptively balances and behavior. The authors formulate a SPH-based likelihood with latent scales and design priors for regression coefficients (ridge or spike-and-slab) and the robustness parameter , enabling uncertainty quantification without ad hoc tuning. They prove posterior-consistency results in both low- and high-dimensional regimes and demonstrate strong sparsity-pattern recovery under mild dependence assumptions, complemented by extensive simulations and a GDP forecast application. Empirically, SPH matches or surpasses traditional / methods across heavy, moderate, and thin-tailed data, while offering calibrated uncertainty and robust variable selection in the presence of serial correlation and contamination. The practical impact lies in providing a scalable, robust Bayesian tool for high-dimensional regression in time-series and econometric contexts where outliers and dependence are pervasive.

Abstract

This paper introduces a loss-based generalized Bayesian methodology for high-dimensional robust regression with serially correlated errors and predictors. The proposed framework employs a novel scaled pseudo-Huber (SPH) loss function, which smooths the well-known Huber loss, effectively balancing quadratic () and absolute linear () loss behaviors. This flexibility enables the framework to accommodate both thin-tailed and heavy-tailed data efficiently. The generalized Bayesian approach constructs a working likelihood based on the SPH loss, facilitating efficient and stable estimation while providing rigorous uncertainty quantification for all model parameters. Notably, this approach allows formal statistical inference without requiring ad hoc tuning parameter selection while adaptively addressing a wide range of tail behavior in the errors. By specifying appropriate prior distributions for the regression coefficients--such as ridge priors for small or moderate-dimensional settings and spike-and-slab priors for high-dimensional settings--the framework ensures principled inference. We establish rigorous theoretical guarantees for accurate parameter estimation and correct predictor selection under sparsity assumptions for a wide range of data generating setups. Extensive simulation studies demonstrate the superior performance of our approach compared to traditional Bayesian regression methods based on and -loss functions. The results highlight its flexibility and robustness, particularly in challenging high-dimensional settings characterized by data contamination.

Paper Structure

This paper contains 28 sections, 4 theorems, 170 equations, 6 figures, 2 tables, 2 algorithms.

Key Result

Proposition 1

Consider the hierarchical distribution for a real random variable: ${\boldsymbol{\varepsilon}} \mid \lambda \sim \mathop{\mathrm{\mathcal{N}}}\nolimits(0, \lambda)$, with $\lambda \mid \alpha \sim \mathop{\mathrm{GIG}}\nolimits(a = 1+\alpha^2, b = \alpha^2, p = 1)$ for any fixed $\alpha > 0$. Then, which is the generalized density associated with the scaled pseudo-Huber loss function with tuning

Figures (6)

  • Figure 1: Median posterior MSEs (over replicates and ${\boldsymbol{\beta}}$ coordinates) for Bayesian $\ell_1$, $\ell_2$, and SPH regression across simulation settings (detailed in Supplementary Tables \ref{['tab:settings-extremely heavy-normal']}-\ref{['tab:settings-thin-spikeslab']}). Panels A and C present low/moderate-dimensional settings with the ridge prior, while panels B and D depict sparse high-dimensional settings with the spike-and-slab prior. Each panel is grouped by error distributions--- heavy, moderate, and thin tails---displayed as subplots/facets. Median posterior MSE values are scaled relative to SPH in each setting, with results for SPH, $\ell_1$, and $\ell_2$ regressions shown in red, green, and purple.
  • Figure 2: Median (over replicates and ${\bm{y}}$ coordinates) prediction MSEs for Bayesian $\ell_1$, $\ell_2$, and SPH regression across simulation settings (detailed in Supplementary Tables \ref{['tab:settings-extremely heavy-normal']}-\ref{['tab:settings-thin-spikeslab']}). Panels A and C show low/moderate-dimensional setups with the ridge prior, while panels B and D depict sparse high-dimensional setups with the spike-and-slab prior. Each panel is grouped by error distributions--- heavy, moderate, and thin tails---displayed as subplots/facets. Median prediction MSE values are scaled relative to SPH in each setting, with results for SPH, $\ell_1$, and $\ell_2$ regressions shown in red, green, and purple.
  • Figure 3: Replication-based coverages (Panel A) and mean lengths (Panel B; vertical axis plotted in a log-scale) of 90% Bayesian credible (equi-tailed) intervals for Bayesian $\ell_1$, $\ell_1$-adj, $\ell_2$, SPH, and SPH-adj regression models across various error distribution categories (extremely heavy, heavy, moderate, and thin) and sample sizes ($n$).
  • Figure 4: Comparing the prediction (posterior) MSEs for the different models, priors, and fit combinations relative to the SPH-SS outlier-filtered refit model for pre-post-COVID (left panel) and pre-post-recession (right panel) analyses. The boxplots display the prediction MSE ratios for various models (L1-N, L1-SS, L2-N, L2-SS, SPH-N, and SPH-SS) for the original fit (red) and outlier-filtered refit (blue). The L2 model lacks an outlier-filtered refit version as it does not include the $\lambda_i$ parameters. The horizontal black line at 1 represents the baseline performance of the SPH-SS outlier-filtered refit model. Lower MSE ratios indicate better predictive performance compared to the baseline.
  • Figure S.1: Visualizing the joint generalized posterior distributions of the first two coordinates $(\beta_1, \beta_2)$ of ${\boldsymbol{\beta}}$, under different losses and a weakly informative Gaussian prior belief distribution for ${\boldsymbol{\beta}}$, through point clouds and density contours. The contour lines represent the joint highest posterior density sets for $(\beta_1, \beta_2)$ at 50%, 80%, 90%, and 95% probability levels.
  • ...and 1 more figures

Theorems & Definitions (11)

  • Proposition 1
  • Remark
  • Remark
  • Theorem 1: Posterior mode consistency with a ridge prior distribution
  • Theorem 2: Posterior distribution consistency with a ridge prior dsitribution
  • Remark
  • Remark
  • Theorem 3: Strong selection consistency with spike-and-slab prior
  • Remark
  • Remark
  • ...and 1 more