Table of Contents
Fetching ...

Derivatives and residual distribution of regularized M-estimators with application to adaptive tuning

Pierre C Bellec, Yiwei Shen

TL;DR

This work develops a framework for robust, regularized M-estimators in linear models with Gaussian design by deriving differentiability properties with respect to both responses and designs, and by characterizing the residual distribution in high-dimensional regimes. It introduces a data-driven adaptive tuning criterion that proxies out-of-sample error without requiring knowledge of the noise distribution or design covariance, using observable quantities such as the estimated degrees of freedom and a residual-based matrix $\boldsymbol{V}$. A stochastic representation links residuals to a Gaussian term whose magnitude reflects the estimator’s out-of-sample error, and the analysis reveals new connections between derivatives and effective degrees of freedom. The paper specializes to the Huber loss with Elastic-Net penalty, provides practical active-set expressions, and validates the theory via simulations, including heavy-tailed noise and anisotropic designs. It also shows how the strong convexity assumption can be relaxed using Lipschitz extensions, broadening applicability to non-smooth penalties typical in high-dimensional robust regression.

Abstract

This paper studies M-estimators with gradient-Lipschitz loss function regularized with convex penalty in linear models with Gaussian design matrix and arbitrary noise distribution. A practical example is the robust M-estimator constructed with the Huber loss and the Elastic-Net penalty and the noise distribution has heavy-tails. Our main contributions are three-fold. (i) We provide general formulae for the derivatives of regularized M-estimators $\hatβ(y,X)$ where differentiation is taken with respect to both $y$ and $X$; this reveals a simple differentiability structure shared by all convex regularized M-estimators. (ii) Using these derivatives, we characterize the distribution of the residual $r_i = y_i-x_i^\top\hatβ$ in the intermediate high-dimensional regime where dimension and sample size are of the same order. (iii) Motivated by the distribution of the residuals, we propose a novel adaptive criterion to select tuning parameters of regularized M-estimators. The criterion approximates the out-of-sample error up to an additive constant independent of the estimator, so that minimizing the criterion provides a proxy for minimizing the out-of-sample error. The proposed adaptive criterion does not require the knowledge of the noise distribution or of the covariance of the design. Simulated data confirms the theoretical findings, regarding both the distribution of the residuals and the success of the criterion as a proxy of the out-of-sample error. Finally our results reveal new relationships between the derivatives of $\hatβ(y,X)$ and the effective degrees of freedom of the M-estimator, which are of independent interest.

Derivatives and residual distribution of regularized M-estimators with application to adaptive tuning

TL;DR

This work develops a framework for robust, regularized M-estimators in linear models with Gaussian design by deriving differentiability properties with respect to both responses and designs, and by characterizing the residual distribution in high-dimensional regimes. It introduces a data-driven adaptive tuning criterion that proxies out-of-sample error without requiring knowledge of the noise distribution or design covariance, using observable quantities such as the estimated degrees of freedom and a residual-based matrix . A stochastic representation links residuals to a Gaussian term whose magnitude reflects the estimator’s out-of-sample error, and the analysis reveals new connections between derivatives and effective degrees of freedom. The paper specializes to the Huber loss with Elastic-Net penalty, provides practical active-set expressions, and validates the theory via simulations, including heavy-tailed noise and anisotropic designs. It also shows how the strong convexity assumption can be relaxed using Lipschitz extensions, broadening applicability to non-smooth penalties typical in high-dimensional robust regression.

Abstract

This paper studies M-estimators with gradient-Lipschitz loss function regularized with convex penalty in linear models with Gaussian design matrix and arbitrary noise distribution. A practical example is the robust M-estimator constructed with the Huber loss and the Elastic-Net penalty and the noise distribution has heavy-tails. Our main contributions are three-fold. (i) We provide general formulae for the derivatives of regularized M-estimators where differentiation is taken with respect to both and ; this reveals a simple differentiability structure shared by all convex regularized M-estimators. (ii) Using these derivatives, we characterize the distribution of the residual in the intermediate high-dimensional regime where dimension and sample size are of the same order. (iii) Motivated by the distribution of the residuals, we propose a novel adaptive criterion to select tuning parameters of regularized M-estimators. The criterion approximates the out-of-sample error up to an additive constant independent of the estimator, so that minimizing the criterion provides a proxy for minimizing the out-of-sample error. The proposed adaptive criterion does not require the knowledge of the noise distribution or of the covariance of the design. Simulated data confirms the theoretical findings, regarding both the distribution of the residuals and the success of the criterion as a proxy of the out-of-sample error. Finally our results reveal new relationships between the derivatives of and the effective degrees of freedom of the M-estimator, which are of independent interest.

Paper Structure

This paper contains 17 sections, 22 theorems, 108 equations, 9 figures, 1 table.

Key Result

Theorem 1

Let assumMain be fulfilled. For almost every $(\boldsymbol{y},\boldsymbol{X})$ the map $(\boldsymbol{y},\boldsymbol{X})\mapsto \widehat{{\boldsymbol{\beta}}}(\boldsymbol{y},\boldsymbol{X})$ is differentiable at $(\boldsymbol{y},\boldsymbol{X})$ and there exists a matrix ${\widehat{\boldsymbol{A}}}\i $\boldsymbol{e}_i\in\mathbb R^n, \boldsymbol{e}_j\in\mathbb R^p$ are canonical basis vectors , $\ps

Figures (9)

  • Figure 1: Heatmaps for $\|\boldsymbol{\Sigma}^{1/2} (\hat{{\boldsymbol{\beta}}} - {\boldsymbol{\beta}}^*)\|^{2}$, its approximation $\|\boldsymbol{r}+({\hat{\mathsf{df}}}/{\mathop{\mathrm{tr}}\limits[\boldsymbol{V}]})\psi(\boldsymbol{r})\|^{2}/n-\| {\boldsymbol{\varepsilon} }\|^{2}/n$ and the approximation error $|\|\boldsymbol{\Sigma}^{1/2} (\hat{{\boldsymbol{\beta}}} - {\boldsymbol{\beta}}^*)\|^{2} - \| \boldsymbol{r} + ({\hat{\mathsf{df}}}/{\mathop{\mathrm{tr}}\limits[\boldsymbol{V}]}) \psi (\boldsymbol{r}) \|^{2} / n - \| {\boldsymbol{\varepsilon} } \|^{2} / n|$ for the Huber loss and Elastic-Net penalty on a grid of tuning parameters $(\lambda, \tau)$ where $\lambda \in [0.0032, 0.41]$ and $\tau \in [10^{-10}, 0.1]$. Each cell is the average over 100 repetitions. See \ref{['sec:simulations']} for more details.
  • Figure 2: Histogram and QQ-plot for $\zeta_{1}$ in \ref{['zeta_i']} under Huber Elastic-Net regression for different choices of tuning parameters $(\lambda, \tau)$. Left Top: $(0.036, 10^{-10})$, Right Top: $(0.054,0.01)$, Left Bottom: $(0.036, 0.01)$, Right Bottom: $(0.024, 0.1)$. Each figure contains 600 data points generated with anisotropic design matrix and iid $\varepsilon_i$ from the $t$-distribution with 2 degrees of freedom. A detailed setup is provided in \ref{['sec:simulations']}.
  • Figure 3: Above: Boxplots for $\hat{\mathsf{df}}, \hat{p}, \hat{n}, \mathop{\mathrm{tr}}\limits[\boldsymbol{V}], \mathop{\mathrm{tr}}\limits[\boldsymbol{\Sigma} {\widehat{\boldsymbol{A}}}]$ and $|\mathop{\mathrm{tr}}\limits [ \boldsymbol{\Sigma} {\widehat{\boldsymbol{A}}} ] - \hat{\mathsf{df}} / \mathop{\mathrm{tr}}\limits [ \boldsymbol{V}]|$ in Huber Elastic-Net regression with $\tau = 10 ^{-10}$ and $\lambda \in [0.0032, 0.41].$ Each box contains 200 data points. Below: heatmaps for $\hat{\mathsf{df}}/n$, $\mathop{\mathrm{tr}}\limits[\boldsymbol{V}]/n$ and $\hat{n}/n =\sum_{i=1}^n\psi'(r_i)/n$ under the simulation setup in \ref{['fig:out-of-sample']}. The detailed simulation setup is given in \ref{['sec:simulations']}.
  • Figure 4: Heatmaps for the Huber loss and Elastic-Net penalty on a grid of tuning parameters with $\Lambda = 0.054 n^{1/2}$ and $(\lambda, \tau)$ where $\lambda \in [0.0032, 0.41]$ and $\tau \in [10^{-10}, 0.1]$. Each cell is the average over 100 repetitions. See the simulation setup in \ref{['sec:simulations']} in the paper for more details.
  • Figure 5: Heatmaps for the Huber loss and Elastic-Net penalty on a grid of tuning parameters with $\Lambda = 0.024 n^{1/2}$ and $(\lambda, \tau)$ where $\lambda \in [0.00062, 0.081]$ and $\tau \in [10^{-10}, 0.1]$. Each cell is the average over 50 repetitions. See the simulation setup in \ref{['sec:simulations']} in the paper for more details.
  • ...and 4 more figures

Theorems & Definitions (25)

  • Theorem 1
  • Remark 2
  • Remark 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Corollary 7
  • Corollary 8
  • Theorem 9
  • Corollary 10
  • ...and 15 more