Table of Contents
Fetching ...

High-Probability Convergence for Composite and Distributed Stochastic Minimization and Variational Inequalities with Heavy-Tailed Noise

Eduard Gorbunov, Abdurakhmon Sadiev, Marina Danilova, Samuel Horváth, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, Peter Richtárik

TL;DR

The paper develops high-probability convergence theory for stochastic composite and distributed optimization and variational inequalities under heavy-tailed noise, highlighting that naive gradient clipping can break fixed-point convergence. It proposes gradient-difference clipping via Prox-clipped-SGD-shift and its distributed variants (DProx-clipped-SGD-shift, DProx-clipped-SSTM-shift), achieving high-probability rates with near-optimal dependence on the accuracy $\\varepsilon$ and confidence level $\beta$, including linear speedups in the number of workers. The results extend to variational inequalities with distributed clipped SGDA/SEG variants and provide acceleration, tight rate comparisons, and parameter-dependent bounds under $(\alpha$-moment) noise assumptions. Overall, the work advances robust, distributed optimization under heavy-tailed noise, offering practical, provably efficient algorithms for composite minimization and VIPs with strong theoretical guarantees.

Abstract

High-probability analysis of stochastic first-order optimization methods under mild assumptions on the noise has been gaining a lot of attention in recent years. Typically, gradient clipping is one of the key algorithmic ingredients to derive good high-probability guarantees when the noise is heavy-tailed. However, if implemented naïvely, clipping can spoil the convergence of the popular methods for composite and distributed optimization (Prox-SGD/Parallel SGD) even in the absence of any noise. Due to this reason, many works on high-probability analysis consider only unconstrained non-distributed problems, and the existing results for composite/distributed problems do not include some important special cases (like strongly convex problems) and are not optimal. To address this issue, we propose new stochastic methods for composite and distributed optimization based on the clipping of stochastic gradient differences and prove tight high-probability convergence results (including nearly optimal ones) for the new methods. Using similar ideas, we also develop new methods for composite and distributed variational inequalities and analyze the high-probability convergence of these methods.

High-Probability Convergence for Composite and Distributed Stochastic Minimization and Variational Inequalities with Heavy-Tailed Noise

TL;DR

The paper develops high-probability convergence theory for stochastic composite and distributed optimization and variational inequalities under heavy-tailed noise, highlighting that naive gradient clipping can break fixed-point convergence. It proposes gradient-difference clipping via Prox-clipped-SGD-shift and its distributed variants (DProx-clipped-SGD-shift, DProx-clipped-SSTM-shift), achieving high-probability rates with near-optimal dependence on the accuracy and confidence level , including linear speedups in the number of workers. The results extend to variational inequalities with distributed clipped SGDA/SEG variants and provide acceleration, tight rate comparisons, and parameter-dependent bounds under -moment) noise assumptions. Overall, the work advances robust, distributed optimization under heavy-tailed noise, offering practical, provably efficient algorithms for composite minimization and VIPs with strong theoretical guarantees.

Abstract

High-probability analysis of stochastic first-order optimization methods under mild assumptions on the noise has been gaining a lot of attention in recent years. Typically, gradient clipping is one of the key algorithmic ingredients to derive good high-probability guarantees when the noise is heavy-tailed. However, if implemented naïvely, clipping can spoil the convergence of the popular methods for composite and distributed optimization (Prox-SGD/Parallel SGD) even in the absence of any noise. Due to this reason, many works on high-probability analysis consider only unconstrained non-distributed problems, and the existing results for composite/distributed problems do not include some important special cases (like strongly convex problems) and are not optimal. To address this issue, we propose new stochastic methods for composite and distributed optimization based on the clipping of stochastic gradient differences and prove tight high-probability convergence results (including nearly optimal ones) for the new methods. Using similar ideas, we also develop new methods for composite and distributed variational inequalities and analyze the high-probability convergence of these methods.
Paper Structure (99 sections, 26 theorems, 647 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 99 sections, 26 theorems, 647 equations, 1 figure, 2 tables, 1 algorithm.

Key Result

Theorem 2.1

Let $n=1$ and Assumptions as:bounded_alpha_moment, as:L_smoothness, and as:QSC with $\mu > 0$ hold for $Q = B_{2R}(x^*)$, $R \geq \|x^0 - x^*\|$, for someIf all of our results, one can use any solution $x^*$, e.g., one can take $x^*$ being a projection of $x^*$ on the solution set.$x^* \in \arg\min_ Then to guarantee $\|x^K - x^*\|^2 \leq \varepsilon$ with probability $\geq 1 - \beta$Prox-clipped-

Figures (1)

  • Figure 1: Comparison between performances of Prox-clipped-SGD, Prox-clipped-SGD-star, Prox-clipped-SGD-shift in solving problem \ref{['eq:experiment_problem']} with fixed clipping level for each of them $\lambda \in \{0.1, 0.01, 0.001\}$.

Theorems & Definitions (49)

  • Theorem 2.1
  • proof : Sketch of the proof
  • Remark 2.2: On the logarithmic factors.
  • Remark 2.3: Dependence of the parameters on $R$.
  • Theorem 2.4: Convergence of DProx-clipped-SGD-shift: quasi-strongly convex case
  • proof : Sketch of the proof
  • Theorem 2.5: Convergence of DProx-clipped-SGD-shift: convex case
  • Remark 2.6: Dependence of the parameters on $\zeta_*$.
  • Theorem 2.7: Convergence of DProx-clipped-SSTM-shift
  • proof : Sketch of the proof
  • ...and 39 more