Robust Stochastic Optimization via Gradient Quantile Clipping

Ibrahim Merad; Stéphane Gaïffas

Robust Stochastic Optimization via Gradient Quantile Clipping

Ibrahim Merad, Stéphane Gaïffas

TL;DR

This new clipping strategy for Stochastic Gradient Descent which uses quantiles of the gradient norm as clipping thresholds provides a robust and efficient optimization algorithm for smooth objectives, that tolerates heavy-tailed samples and a fraction of outliers in the data stream akin to Huber contamination.

Abstract

We introduce a clipping strategy for Stochastic Gradient Descent (SGD) which uses quantiles of the gradient norm as clipping thresholds. We prove that this new strategy provides a robust and efficient optimization algorithm for smooth objectives (convex or non-convex), that tolerates heavy-tailed samples (including infinite variance) and a fraction of outliers in the data stream akin to Huber contamination. Our mathematical analysis leverages the connection between constant step size SGD and Markov chains and handles the bias introduced by clipping in an original way. For strongly convex objectives, we prove that the iteration converges to a concentrated distribution and derive high probability bounds on the final estimation error. In the non-convex case, we prove that the limit distribution is localized on a neighborhood with low gradient. We propose an implementation of this algorithm using rolling quantiles which leads to a highly efficient optimization procedure with strong robustness properties, as confirmed by our numerical experiments.

Robust Stochastic Optimization via Gradient Quantile Clipping

TL;DR

Abstract

Paper Structure (45 sections, 12 theorems, 119 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 45 sections, 12 theorems, 119 equations, 4 figures, 1 table, 2 algorithms.

Introduction
Contributions.
Related works.
Agenda.
Preliminaries
Strongly Convex Objectives
Non-Convex Objectives
Implementation and Numerical Experiments
Linear regression.
Logistic regression.
Classification with shallow networks.
Conclusion
Additional experimental results
Classification with shallow networks.
Expectation estimation.
...and 30 more sections

Key Result

Theorem 1

Let Assumptions asm:lipsmooth-asm:grad_moments hold and assume there is a quantile $p \in [\eta, 1-\eta]$ such that Then, for a step size $\beta$ satisfying the Markov chain $(\theta_t)_{t\geq 0}$ generated by QC-SGD with parameters $\beta$ and $p$ converges geometrically to a unique invariant measure $\pi_{\beta, p}$: for any initial $\theta_0 \in \mathbb{R}^d,$ there is $\rho < 1$ and $M < \in

Figures (4)

Figure 1: Evolution of $\|\theta_t-\theta^\star\|$ on the tasks of linear regression (top row) and logistic regression (bottom row) averaged over 100 runs at increasing corruption levels (error bars represent half the standard deviation). Estimators based on Huber's loss are strongly affected by data corruption. SGD with constant clipping thresholds is robust but slow to converge for linear regression and requires tuning for better final precision. RQC-SGD combines fast convergence with good final precision thanks to its adaptive clipping strategy.
Figure 2: Evolution of the test loss ($y$-axis) against iteration $t$ ($x$-axis) for the training of a single hidden layer network on different real world classification datasets (average over 20 runs). We observe more consistent and stable objective decrease for RQC-SGD whereas constant clipping baselines are slower and may fail to converge.
Figure 3: Evolution of the test loss ($y$-axis) against iteration $t$ ($x$-axis) for the training of a single hidden layer network on additional real world classification datasets (average over 20 runs).
Figure 4: Evolution of $\|\theta_t-\theta^\star\|$ ($y$-axis) against iteration $t$ ($x$-axis) for the expectation estimation task, averaged over 100 runs at different corruption levels $\eta$ (bands widths correspond to the standard deviation of the 100 runs). For $\eta = 0.04,$ the evolution on a single run is also displayed. We observe good performance for RQC-SGD for increasing $\eta$ while CMOM and GMOM are more sensitive.

Theorems & Definitions (22)

Definition 1
Theorem 1: Geometric ergodicity
Proposition 1
Proposition 2
Corollary 1
Corollary 2
Theorem 2: Ergodicity
Proposition 3
Lemma 1
proof
...and 12 more

Robust Stochastic Optimization via Gradient Quantile Clipping

TL;DR

Abstract

Robust Stochastic Optimization via Gradient Quantile Clipping

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (22)