Robust variance-regularized risk minimization with concomitant scaling

Matthew J. Holland

Robust variance-regularized risk minimization with concomitant scaling

Matthew J. Holland

TL;DR

This work tackles learning under heavy-tailed losses by optimizing a mean–standard-deviation objective, rather than the traditional mean loss. It introduces a gradient-friendly procedure called Modified Sun–Huber, built by extending robust one-dimensional mean estimation to the mean–SD setting via a joint scale-location criterion over $(h,a,b)$. The authors establish theoretical connections between the population objective and the mean–SD objective, derive finite-sample concentration results, and propose a practical algorithm with a simple scheduling of $(\alpha,\beta)$ (e.g., $\beta=\beta_0/\sqrt{n}$ and $\alpha(\beta)=\beta$). Empirical results on simulated and real datasets show that Modified Sun–Huber often matches or outperforms CVaR and DRO baselines in mean–SD performance, while remaining simple to integrate into standard gradient-based pipelines. Overall, the approach provides a scalable, robust alternative for risk-sensitive learning in the presence of heavy-tailed losses.

Abstract

Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in traditional machine learning workflows. Empirically, we verify that our proposed approach, despite its simplicity, performs as well or better than even the best-performing candidates derived from alternative criteria such as CVaR or DRO risks on a variety of datasets.

Robust variance-regularized risk minimization with concomitant scaling

TL;DR

. The authors establish theoretical connections between the population objective and the mean–SD objective, derive finite-sample concentration results, and propose a practical algorithm with a simple scheduling of

(e.g.,

and

). Empirical results on simulated and real datasets show that Modified Sun–Huber often matches or outperforms CVaR and DRO baselines in mean–SD performance, while remaining simple to integrate into standard gradient-based pipelines. Overall, the approach provides a scalable, robust alternative for risk-sensitive learning in the presence of heavy-tailed losses.

Abstract

Paper Structure (24 sections, 8 theorems, 108 equations, 7 figures, 1 algorithm)

This paper contains 24 sections, 8 theorems, 108 equations, 7 figures, 1 algorithm.

Introduction
Background
Robust mean estimation
Good-enough ancillary scaling
A bridge between two problems
Overview of contributions and limitations
Theory
Links to the mean-SD objective
Guiding the optimal threshold
Deriving an algorithm for finite samples
Stationary points of mean-variance
Comparison with dual form of DRO risk
Empirical analysis
Methods to be compared
Simulated noisy classification on the plane
...and 9 more sections

Key Result

Proposition 1

Let $\mathcal{H}$ be such that $\mathop{\mathrm{\mathbf{E}}}\nolimits_{\mu}\lvert{\mathop{\mathrm{\mathsf{L}}}\nolimits(h)}\rvert^{2} < \infty$ for each $h \in \mathcal{H}$. If we set $\alpha = \alpha(\beta)$ such that $\alpha(\beta)/\sqrt{\beta} \to \widetilde{\alpha} \in [0,\infty)$ as $\beta \to for any choice of threshold $a \in \mathbb{R}$ and weight $\alpha \geq 0$.

Figures (7)

Figure 1: From left to right, we plot the graphs of $\rho(\cdot)$, $\rho^{\prime}(\cdot)$, and $\rho^{\prime\prime}(\cdot)$ with $\rho$ as in (\ref{['eqn:sun_huber_fn']}). In the middle plot, the dotted curves represent the upper (blue) and lower (dark pink) bounds in (\ref{['eqn:motivation_catoni_condition']}) with $\gamma=1$.
Figure 2: Graphs of the smooth Huber function, with $\rho$ as in (\ref{['eqn:sun_huber_fn']}), over a range of smoothing parameters. For visual comparison, the graph of $x \mapsto x^{2}/2$ is plotted with a thick dashed green curve.
Figure 3: Graph of the Legendre transform $\rho^{\ast}$ as given in (\ref{['eqn:rho_legendre']}) over $(-1,1)$.
Figure 4: 2D classification example from §\ref{['sec:empirical_sims']}. The red line represents the initial value used by each method.
Figure 5: From each method class, we show the classification error rate and Euclidean norm trajectories corresponding to the setting that achieved the best error rate after the final iteration.
...and 2 more figures

Theorems & Definitions (15)

Proposition 1
Proposition 2
Proposition 3: Concentration at a shifted location
Proposition 4: Joint objective is non-convex and non-smooth
Proposition 5
Lemma 6: Useful inequalities
Lemma 7
Lemma 8: Properties of partial objective
proof : Proof of Lemma \ref{['lem:partial_objective']}
proof : Proof of Proposition \ref{['prop:crit_optimized_scale']}
...and 5 more

Robust variance-regularized risk minimization with concomitant scaling

TL;DR

Abstract

Robust variance-regularized risk minimization with concomitant scaling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (15)