SOREL: A Stochastic Algorithm for Spectral Risks Minimization

Yuze Ge; Rujun Jiang

SOREL: A Stochastic Algorithm for Spectral Risks Minimization

Yuze Ge, Rujun Jiang

TL;DR

This work addresses spectral risk minimization, which weights sample losses non-uniformly to emphasize tail risk, a setting where standard empirical risk minimization (with equal weights) is insufficient. The authors develop SOREL, a stochastic gradient-based method built on a minimax reformulation and a proximal primal-dual framework, with trajectory stabilization to ensure convergence to the true spectral-risk minimizer. They prove a near-optimal convergence rate of $\widetilde{O}(1/\sqrt{\epsilon})$ for strongly convex regularization and demonstrate strong empirical performance in terms of runtime and sample complexity across multiple spectral risks (CVaR, ESRM, Extremile) and real datasets. Compared to baselines like SGD, LSVRG, and Prospect, SOREL offers consistently better convergence to the true optimum, and minibatching further accelerates learning, highlighting its practical impact for scalable, risk-aware learning in finance and ML fairness contexts.

Abstract

The spectral risk has wide applications in machine learning, especially in real-world decision-making, where people are not only concerned with models' average performance. By assigning different weights to the losses of different sample points, rather than the same weights as in the empirical risk, it allows the model's performance to lie between the average performance and the worst-case performance. In this paper, we propose SOREL, the first stochastic gradient-based algorithm with convergence guarantees for the spectral risk minimization. Previous algorithms often consider adding a strongly concave function to smooth the spectral risk, thus lacking convergence guarantees for the original spectral risk. We theoretically prove that our algorithm achieves a near-optimal rate of $\widetilde{O}(1/\sqrtε)$ in terms of $ε$. Experiments on real datasets show that our algorithm outperforms existing algorithms in most cases, both in terms of runtime and sample complexity.

SOREL: A Stochastic Algorithm for Spectral Risks Minimization

TL;DR

for strongly convex regularization and demonstrate strong empirical performance in terms of runtime and sample complexity across multiple spectral risks (CVaR, ESRM, Extremile) and real datasets. Compared to baselines like SGD, LSVRG, and Prospect, SOREL offers consistently better convergence to the true optimum, and minibatching further accelerates learning, highlighting its practical impact for scalable, risk-aware learning in finance and ML fairness contexts.

Abstract

in terms of

. Experiments on real datasets show that our algorithm outperforms existing algorithms in most cases, both in terms of runtime and sample complexity.

Paper Structure (23 sections, 10 theorems, 45 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 23 sections, 10 theorems, 45 equations, 3 figures, 2 tables, 2 algorithms.

Introduction
Our Contributions.
Related work
Statistical Properties of the Spectral Risk.
Applications.
Existing Optimization Methods.
Algorithm
Challenges of Stochastic Optimization for Spectral Risks
Biases of Stochastic Subgradient Estimators.
Stabilizing the Optimization Trajectory.
Stochastic Optimization for the Primal Variable.
The SOREL Algorithm
Theoretical Analysis
Experiments
Results.
...and 8 more sections

Key Result

Lemma 1

Suppose Assumption assumption:basic holds. Let $\{\boldsymbol{w}_k\}$ and $\{\boldsymbol{\lambda}_k\}$ be the sequences generated by Algorithm alg:main_alg. Then for any $\boldsymbol{w}\in\mathbb{R}^d$ and $\boldsymbol{\lambda}\in \Pi_{\boldsymbol{\sigma}}$, the following inequality holds,

Figures (3)

Figure 1: The level set plot of 2D least-square regression with primal-dual optimization trajectories described in Section \ref{['sec:alg-prob']}. The max subproblem does not have a proximal term (left) or has a proximal term (right). The min subproblem does not have a proximal term. The black star represents the optimal point. The sample points are obtained by projecting the yacht dataset onto $\mathbb{R}^2$ using PCA.
Figure 2: Suboptimality of spectral risks for different algorithms without mini-batching. The $x$-axis represents the effective number of samples used by the algorithm divided by $n$ (odd columns) or CPU time (even columns). Each row corresponds to the same dataset, and each column corresponds to the same type of the spectral risk.
Figure 3: Suboptimality of spectral risks for different algorithms withmini-batching. The $x$-axis represents the effective number of samples used by the algorithm divided by $n$ (odd columns) or CPU time (even columns).

Theorems & Definitions (19)

Lemma 1
Theorem 1
Corollary 1
Remark 1
Lemma 2
proof
Lemma 3
proof
Lemma 4
proof
...and 9 more

SOREL: A Stochastic Algorithm for Spectral Risks Minimization

TL;DR

Abstract

SOREL: A Stochastic Algorithm for Spectral Risks Minimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (19)