Table of Contents
Fetching ...

Robust variance-regularized risk minimization with concomitant scaling

Matthew J. Holland

TL;DR

This work tackles learning under heavy-tailed losses by optimizing a mean–standard-deviation objective, rather than the traditional mean loss. It introduces a gradient-friendly procedure called Modified Sun–Huber, built by extending robust one-dimensional mean estimation to the mean–SD setting via a joint scale-location criterion over $(h,a,b)$. The authors establish theoretical connections between the population objective and the mean–SD objective, derive finite-sample concentration results, and propose a practical algorithm with a simple scheduling of $(\alpha,\beta)$ (e.g., $\beta=\beta_0/\sqrt{n}$ and $\alpha(\beta)=\beta$). Empirical results on simulated and real datasets show that Modified Sun–Huber often matches or outperforms CVaR and DRO baselines in mean–SD performance, while remaining simple to integrate into standard gradient-based pipelines. Overall, the approach provides a scalable, robust alternative for risk-sensitive learning in the presence of heavy-tailed losses.

Abstract

Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in traditional machine learning workflows. Empirically, we verify that our proposed approach, despite its simplicity, performs as well or better than even the best-performing candidates derived from alternative criteria such as CVaR or DRO risks on a variety of datasets.

Robust variance-regularized risk minimization with concomitant scaling

TL;DR

This work tackles learning under heavy-tailed losses by optimizing a mean–standard-deviation objective, rather than the traditional mean loss. It introduces a gradient-friendly procedure called Modified Sun–Huber, built by extending robust one-dimensional mean estimation to the mean–SD setting via a joint scale-location criterion over . The authors establish theoretical connections between the population objective and the mean–SD objective, derive finite-sample concentration results, and propose a practical algorithm with a simple scheduling of (e.g., and ). Empirical results on simulated and real datasets show that Modified Sun–Huber often matches or outperforms CVaR and DRO baselines in mean–SD performance, while remaining simple to integrate into standard gradient-based pipelines. Overall, the approach provides a scalable, robust alternative for risk-sensitive learning in the presence of heavy-tailed losses.

Abstract

Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in traditional machine learning workflows. Empirically, we verify that our proposed approach, despite its simplicity, performs as well or better than even the best-performing candidates derived from alternative criteria such as CVaR or DRO risks on a variety of datasets.
Paper Structure (24 sections, 8 theorems, 108 equations, 7 figures, 1 algorithm)

This paper contains 24 sections, 8 theorems, 108 equations, 7 figures, 1 algorithm.

Key Result

Proposition 1

Let $\mathcal{H}$ be such that $\mathop{\mathrm{\mathbf{E}}}\nolimits_{\mu}\lvert{\mathop{\mathrm{\mathsf{L}}}\nolimits(h)}\rvert^{2} < \infty$ for each $h \in \mathcal{H}$. If we set $\alpha = \alpha(\beta)$ such that $\alpha(\beta)/\sqrt{\beta} \to \widetilde{\alpha} \in [0,\infty)$ as $\beta \to for any choice of threshold $a \in \mathbb{R}$ and weight $\alpha \geq 0$.

Figures (7)

  • Figure 1: From left to right, we plot the graphs of $\rho(\cdot)$, $\rho^{\prime}(\cdot)$, and $\rho^{\prime\prime}(\cdot)$ with $\rho$ as in (\ref{['eqn:sun_huber_fn']}). In the middle plot, the dotted curves represent the upper (blue) and lower (dark pink) bounds in (\ref{['eqn:motivation_catoni_condition']}) with $\gamma=1$.
  • Figure 2: Graphs of the smooth Huber function, with $\rho$ as in (\ref{['eqn:sun_huber_fn']}), over a range of smoothing parameters. For visual comparison, the graph of $x \mapsto x^{2}/2$ is plotted with a thick dashed green curve.
  • Figure 3: Graph of the Legendre transform $\rho^{\ast}$ as given in (\ref{['eqn:rho_legendre']}) over $(-1,1)$.
  • Figure 4: 2D classification example from §\ref{['sec:empirical_sims']}. The red line represents the initial value used by each method.
  • Figure 5: From each method class, we show the classification error rate and Euclidean norm trajectories corresponding to the setting that achieved the best error rate after the final iteration.
  • ...and 2 more figures

Theorems & Definitions (15)

  • Proposition 1
  • Proposition 2
  • Proposition 3: Concentration at a shifted location
  • Proposition 4: Joint objective is non-convex and non-smooth
  • Proposition 5
  • Lemma 6: Useful inequalities
  • Lemma 7
  • Lemma 8: Properties of partial objective
  • proof : Proof of Lemma \ref{['lem:partial_objective']}
  • proof : Proof of Proposition \ref{['prop:crit_optimized_scale']}
  • ...and 5 more