Table of Contents
Fetching ...

High-dimensional robust regression under heavy-tailed data: Asymptotics and Universality

Urte Adomaityte, Leonardo Defilippis, Bruno Loureiro, Gabriele Sicuro

TL;DR

The paper develops a high-dimensional theory for robust regression under heavy-tailed contamination by modelling covariates as elliptical scale mixtures and analysing regularised M-estimators via replica methods in the proportional regime. It provides sharp asymptotic characterisations for both the M-estimator and the Bayes-optimal estimator, showing that optimal Huber tuning alone is insufficient in high dimensions and that additional regularisation is necessary, while ridge regression attains universal or tail-dependent decay rates depending on covariate moments. A universal framework is established for generalized linear estimation with convex penalties trained on an elliptical mixture model, with explicit fixed-point equations for the order parameters and proximal operators, and extension to multi-cluster mixtures. The results illuminate how heavy-tailed data alter convergence rates and optimal regularisation, offering practical guidance for robust high-dimensional regression in settings where covariates and labels exhibit heavy tails, including real-data validation on stock-market data.

Abstract

We investigate the high-dimensional properties of robust regression estimators in the presence of heavy-tailed contamination of both the covariates and response functions. In particular, we provide a sharp asymptotic characterisation of M-estimators trained on a family of elliptical covariate and noise data distributions including cases where second and higher moments do not exist. We show that, despite being consistent, the Huber loss with optimally tuned location parameter $δ$ is suboptimal in the high-dimensional regime in the presence of heavy-tailed noise, highlighting the necessity of further regularisation to achieve optimal performance. This result also uncovers the existence of a transition in $δ$ as a function of the sample complexity and contamination. Moreover, we derive the decay rates for the excess risk of ridge regression. We show that, while it is both optimal and universal for covariate distributions with finite second moment, its decay rate can be considerably faster when the covariates' second moment does not exist. Finally, we show that our formulas readily generalise to a richer family of models and data distributions, such as generalised linear estimation with arbitrary convex regularisation trained on mixture models.

High-dimensional robust regression under heavy-tailed data: Asymptotics and Universality

TL;DR

The paper develops a high-dimensional theory for robust regression under heavy-tailed contamination by modelling covariates as elliptical scale mixtures and analysing regularised M-estimators via replica methods in the proportional regime. It provides sharp asymptotic characterisations for both the M-estimator and the Bayes-optimal estimator, showing that optimal Huber tuning alone is insufficient in high dimensions and that additional regularisation is necessary, while ridge regression attains universal or tail-dependent decay rates depending on covariate moments. A universal framework is established for generalized linear estimation with convex penalties trained on an elliptical mixture model, with explicit fixed-point equations for the order parameters and proximal operators, and extension to multi-cluster mixtures. The results illuminate how heavy-tailed data alter convergence rates and optimal regularisation, offering practical guidance for robust high-dimensional regression in settings where covariates and labels exhibit heavy tails, including real-data validation on stock-market data.

Abstract

We investigate the high-dimensional properties of robust regression estimators in the presence of heavy-tailed contamination of both the covariates and response functions. In particular, we provide a sharp asymptotic characterisation of M-estimators trained on a family of elliptical covariate and noise data distributions including cases where second and higher moments do not exist. We show that, despite being consistent, the Huber loss with optimally tuned location parameter is suboptimal in the high-dimensional regime in the presence of heavy-tailed noise, highlighting the necessity of further regularisation to achieve optimal performance. This result also uncovers the existence of a transition in as a function of the sample complexity and contamination. Moreover, we derive the decay rates for the excess risk of ridge regression. We show that, while it is both optimal and universal for covariate distributions with finite second moment, its decay rate can be considerably faster when the covariates' second moment does not exist. Finally, we show that our formulas readily generalise to a richer family of models and data distributions, such as generalised linear estimation with arbitrary convex regularisation trained on mixture models.
Paper Structure (35 sections, 99 equations, 8 figures, 1 table)

This paper contains 35 sections, 99 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Value of $\varepsilon_{\rm est}$ obtained using a regularised Huber at given $\lambda=10^{-3}$ as a function of $\delta$ for different values of $\alpha$. Here the contamination level is $\epsilon_{\rm n}=0.5$.
  • Figure 2: Label contamination of standard Gaussian covariates, with distribution $p({\boldsymbol{{x}}})=\mathcal{N}({\boldsymbol{{x}}};0,1/d {\boldsymbol{{I}}}_d)$, as a function of the sample complexity $\alpha=n/d$ for different contamination levels $\epsilon_{\rm n}\in[0,1]$ as in Eq. \ref{['eq:def:hubercontnoise']}. The label-contaminating distribution is inverse-gamma $\hat{\varrho}_0(\sigma)\propto\sigma^{-2a-1}\exp(-b\sigma^{-2})$, for details see Table \ref{['tab:examples']}. Theoretical predictions (lines) are compared with the results of numerical experiments (dots) obtained averaging over $20$ instances with $d=10^3$. (Left) Purely Gaussian noise $(\epsilon_{\rm n}=0)$. (Top). Case $\epsilon_{\rm n}>0$, $b=1$ and $a=4/5<1$, implying $\hat{\varrho}_0(\sigma)$ has infinite mean and thus $\mathbb E[\eta^2]=+\infty$ but $\mathbb E[\eta]<+\infty$. The performance degrades as the contamination level $\epsilon_{\rm n}$ is increased. Optimally regularised Huber (red) achieves the optimal Bayesian performance (solid), while with fixed small regularisation the Huber estimator performs suboptimally (purple). Square loss results are not represented as, in this case, the average estimation error is not finite. (Bottom). Case $\epsilon_{\rm n}>0$, $a=1+b=1+1/10$, corresponding to $\mathbb E[\eta^2]=1$. The performance uniformly improves as the contamination $\epsilon_{\rm n}$ grows. Optimally regularised Huber (red) achieves optimal Bayesian performance (solid), while both Huber with untuned regularisation (purple) and optimally regularised ridge (blue) are suboptimal.
  • Figure 3: Estimation error $\varepsilon_{\rm est}$ as in Eq. \ref{['eq:def:esterr']} at large-$\alpha$ using regularised optimal Huber loss (solid lines). The covariates are Gaussian, whereas the label noise is obtained as in Eq. \ref{['eq:def:hubercontnoise']} with $\varrho(\sigma)\sim\sigma^{-2a-1}$, $a>0$. The results are compared with the Bayes-optimal performance (squares). The dotted line shows a scaling of $\alpha^{-1}$.
  • Figure 4: (Left) Estimation error as a function of sample complexity $\alpha=n/d$. Dots correspond to the average test error of $50$ numerical experiments in dimension $d=10^3$. Covariates are contaminated as in Eq. \ref{['eq:def:hubercont']} using for the contaminating distribution $\varrho_0$ an inverse-gamma distribution with $b=a-1=1/10$, see Table \ref{['tab:examples']} for details. The labels are distributed according to an inverse-gamma distribution ($\epsilon_{\rm n}=1$) with parameters $a=1+b=2$, corresponding to $\mathbb E[\eta^2]=1$. (Right) Estimation error $\varepsilon_{\rm est}$ as in Eq. \ref{['eq:def:esterr']} at large-$\alpha$ obtained from our theory for the square loss (dashed line), regularised optimal Huber loss (solid line) and Bayes-optimal performance given by Result \ref{['res:BO']} (squares). The covariates' variance is Pareto-distributed to have $\varrho(\sigma)\sim\sigma^{-2a-1}$ with $a>0$, and the label noise is Gaussian. White dots correspond to numerical experiments in dimension $d=50$, averaged over $50$ instances. The black dotted line shows a scaling of $\alpha^{-1}$.
  • Figure 5: Estimation error $\varepsilon_{\rm est}$ as in Eq. \ref{['eq:def:esterr']} as a function of $\alpha$ using $\ell_2$-regularised square loss with $\lambda=1/10$. Dots correspond to the average error of $200$ numerical experiments (varying $n$) on ${\boldsymbol{{z}}}_i$ with labels generated as Eq. \ref{['eq:def:data']} with Gaussian label noise. Theoretical predictions obtained by taking into account the mean of the covariance spectrum and the covariates' power-law decay (blue) show better agreement with the experiment than a standard Gaussian equivalence assumption (orange).
  • ...and 3 more figures