Convergence of Decentralized Stochastic Subgradient-based Methods for Nonsmooth Nonconvex functions

Siyuan Zhang; Nachuan Xiao; Xin Liu

Convergence of Decentralized Stochastic Subgradient-based Methods for Nonsmooth Nonconvex functions

Siyuan Zhang, Nachuan Xiao, Xin Liu

TL;DR

The paper introduces a unified decentralized stochastic subgradient framework for nonsmooth nonconvex optimization lacking Clarke regularity. By relating discrete updates to a coercive Lyapunov function-driven differential inclusion, it proves consensus and convergence to the DI’s stable set under diminishing step-sizes for random reshuffling and with-replacement sampling. It shows that DSGD, DSGD-M, DSGD-T, and DSignSGD all fit into the framework and attain global or high-probability convergence to the conservative-field-based critical points, thereby providing the first convergence guarantees in this setting. Preliminary experiments on nonsmooth neural networks demonstrate efficiency and robustness, with DSignSGD offering competitive performance. The work paves the way for further analysis of rates, time-varying networks, asynchronous updates, and communication-compression in nonsmooth decentralized optimization.

Abstract

In this paper, we focus on the decentralized stochastic subgradient-based methods in minimizing nonsmooth nonconvex functions without Clarke regularity, especially in the decentralized training of nonsmooth neural networks. We propose a general framework that unifies various decentralized subgradient-based methods, such as decentralized stochastic subgradient descent (DSGD), DSGD with gradient-tracking technique (DSGD-T), and DSGD with momentum (DSGD-M). To establish the convergence properties of our proposed framework, we relate the discrete iterates to the trajectories of a continuous-time differential inclusion, which is assumed to have a coercive Lyapunov function with a stable set $\mathcal{A}$. We prove the asymptotic convergence of the iterates to the stable set $\mathcal{A}$ with sufficiently small and diminishing step-sizes. These results provide first convergence guarantees for some well-recognized of decentralized stochastic subgradient-based methods without Clarke regularity of the objective function. Preliminary numerical experiments demonstrate that our proposed framework yields highly efficient decentralized stochastic subgradient-based methods with convergence guarantees in the training of nonsmooth neural networks.

Convergence of Decentralized Stochastic Subgradient-based Methods for Nonsmooth Nonconvex functions

TL;DR

Abstract

. We prove the asymptotic convergence of the iterates to the stable set

with sufficiently small and diminishing step-sizes. These results provide first convergence guarantees for some well-recognized of decentralized stochastic subgradient-based methods without Clarke regularity of the objective function. Preliminary numerical experiments demonstrate that our proposed framework yields highly efficient decentralized stochastic subgradient-based methods with convergence guarantees in the training of nonsmooth neural networks.

Paper Structure (29 sections, 21 theorems, 111 equations, 9 figures, 1 table, 4 algorithms)

This paper contains 29 sections, 21 theorems, 111 equations, 9 figures, 1 table, 4 algorithms.

Introduction
Existing works on decentralized stochastic optimization
Existing works on stochastic subgradient-based methods for nonsmooth nonconvex optimization
A general framework for decentralized stochastic subgradient-based methods
Contributions
Organizations
Preliminaries
Notations
Mixing matrix
Nonsmooth analysis and conservative field
Differential inclusion and stochastic subgradient methods
A General Framework for Decentralized Stochastic Subgradient-based Methods
Basic assumptions and main results
Convergence with random reshuffling
Convergence under with-replacement sampling
...and 14 more sections

Key Result

Corollary 2.2

For any mixing matrix ${\bm W} \in \mathbb{R}^{d\times d}$ that corresponds to a connected graph $\mathtt{G}$, all the eigenvalues of ${\bm W}$ lie in $(-1,1]$, and ${\bm W}$ has a single eigenvalue at $1$ that admits ${\bm 1_{d}}$ as its eigenvector.

Figures (9)

Figure 1: Numerical performance comparison of DSGD, DSGD-M, and DSignSGD in training ResNet50 on CIFAR-100 dataset using random reshuffling strategy.
Figure 2: Numerical performance comparison of DSGD, DSGD-M and DSignSGD in training ResNet50 on CIFAR-10 dataset using random reshuffling strategy.
Figure 3: Numerical performance comparison of DSGD, DSGD-M, and DSignSGD in training ResNet50 on CIFAR-100 dataset using with-replacement sampling strategy.
Figure 4: Numerical performance comparison of DSGD, DSGD-M, and DSignSGD in training ResNet50 on CIFAR-10 dataset using with-replacement sampling strategy.
Figure 5: Numerical performance comparison of DSGD and DSGD-T in training ResNet50 on CIFAR-100 dataset using random reshuffling strategy.
...and 4 more figures

Theorems & Definitions (47)

Definition 2.1
Corollary 2.2
Definition 2.3
Definition 2.4
Definition 2.5: Conservative field
Definition 2.6
Proposition 2.7: Theorem 5 in bolte2021conservative
Definition 2.8
Definition 2.9
Definition 2.10: Lyapunov function
...and 37 more

Convergence of Decentralized Stochastic Subgradient-based Methods for Nonsmooth Nonconvex functions

TL;DR

Abstract

Convergence of Decentralized Stochastic Subgradient-based Methods for Nonsmooth Nonconvex functions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (47)