Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization

Zhen Qin; Zhishuai Liu; Pan Xu

Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization

Zhen Qin, Zhishuai Liu, Pan Xu

TL;DR

This work provides the first convergence analysis of sign-based optimization under random reshuffling for nonconvex finite-sum problems, revealing a potentially non-vanishing error term $\sigma$ that arises from the combination of sign updates and without-replacement sampling. To overcome this, the authors introduce SignRVR with variance reduction and SignRVM with momentum, proving faster convergence rates in the centralized setting and extending them to distributed environments with data partitioning and majority voting. The results establish provable guarantees for practical, sign-based optimization methods that use RR, both centrally and across machines, and they are backed by experiments on Rosenbrock functions and MNIST showing competitive or superior performance to strong baselines. Overall, the paper bridges theory and practice for sign-based optimization in nonconvex finite-sum problems, offering new tools for efficient, communication-friendly learning in distributed settings.

Abstract

signSGD is popular in nonconvex optimization due to its communication efficiency. Yet, existing analyses typically assume data are sampled with replacement in each iteration, contradicting a common practical implementation where data are randomly reshuffled and sequentially fed into the algorithm. This gap leaves the theoretical understanding of the more practical algorithm, signSGD with random reshuffling (SignRR), largely unexplored. We develop the first analysis of SignRR to identify the core technical challenge that prevents a thorough convergence analysis of this method. In particular, given a dataset of size $n$ and $T$ epochs, we show that the expected gradient norm of SignRR is upper bounded by $O(\log(nT)/\sqrt{nT} + σ)$, where $σ$ is the averaged conditional mean square error that may not vanish. To tackle this limitation, we develop two new sign-based algorithms under random reshuffling: SignRVR, which incorporates variance-reduced gradients, and SignRVM, which integrates momentum-based updates. Both algorithms achieve a faster convergence rate of ${O}(\log(nT)/\sqrt{nT} +\log(nT)\sqrt{n}/\sqrt{T})$. We further extend our algorithms to a distributed setting, with a convergence rate of ${O}(\log(n_0T)/\sqrt{n_0T} +\log (n_0T)\sqrt{n_0}/\sqrt{T})$, where $n_0$ is the size of the dataset of a single machine. These results mark the first step towards the theoretical understanding of practical implementation of sign-based optimization algorithms. Finally, we back up our theoretical findings through experiments on simulated and real-world problems, verifying that randomly reshuffled sign methods match or surpass existing baselines.

Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization

TL;DR

This work provides the first convergence analysis of sign-based optimization under random reshuffling for nonconvex finite-sum problems, revealing a potentially non-vanishing error term

that arises from the combination of sign updates and without-replacement sampling. To overcome this, the authors introduce SignRVR with variance reduction and SignRVM with momentum, proving faster convergence rates in the centralized setting and extending them to distributed environments with data partitioning and majority voting. The results establish provable guarantees for practical, sign-based optimization methods that use RR, both centrally and across machines, and they are backed by experiments on Rosenbrock functions and MNIST showing competitive or superior performance to strong baselines. Overall, the paper bridges theory and practice for sign-based optimization in nonconvex finite-sum problems, offering new tools for efficient, communication-friendly learning in distributed settings.

Abstract

and

epochs, we show that the expected gradient norm of SignRR is upper bounded by

, where

is the averaged conditional mean square error that may not vanish. To tackle this limitation, we develop two new sign-based algorithms under random reshuffling: SignRVR, which incorporates variance-reduced gradients, and SignRVM, which integrates momentum-based updates. Both algorithms achieve a faster convergence rate of

. We further extend our algorithms to a distributed setting, with a convergence rate of

, where

is the size of the dataset of a single machine. These results mark the first step towards the theoretical understanding of practical implementation of sign-based optimization algorithms. Finally, we back up our theoretical findings through experiments on simulated and real-world problems, verifying that randomly reshuffled sign methods match or surpass existing baselines.

Paper Structure (34 sections, 14 theorems, 97 equations, 5 figures, 3 algorithms)

This paper contains 34 sections, 14 theorems, 97 equations, 5 figures, 3 algorithms.

Introduction
Notation
Most Related Work
Sign-based SGD
Random Reshuffling
Identifying Theoretical Challenges of SignRR
Algorithm Description of SignRR
Problem Formulation and Preliminary Analysis of $\text{SignRR}$
Reducing the Variance of SignRR
SignRVR with Momentum Updates
Distributed Sign-based Random Reshuffling Algorithms
SignRVR in the Distributed Setting
Experiments
Minimizing the Finite-Sum of Rosenbrock Functions
Centralized Setting
...and 19 more sections

Key Result

Proposition 3.3

Under Assp:LBAssp:smooth, if we set the stepsize $\gamma_t^i =\frac{\gamma_0}{\sqrt{nt+i+1}}$, where $\gamma_0>0$ is a universal constant, then the iterates of alg:sign_RR satisfy:

Figures (5)

Figure 1: Averaged results over 5 independent repetitions in the centralized setting with different data variances. (a)-(b): SignRR, SignRVR and SignRVM with a constant learning rate. (c)-(d): SignRR, SignRVR and SignRVM with a diminishing learning rate. $\text{SignRVR}$ and $\text{SignRVM}$ achieve the best results in all settings.
Figure 2: Averaged results over 5 independent repetitions in the distributed setting with different numbers of workers. dist-SignRVR and dist-SignRVM match SSDM's performance and outperform others as $M$ increases.
Figure 3: The test accuracies of experiment results on MNIST under the centralized setting with different batch sizes. Results are averaged over $5$ independent repetitions.
Figure 4: The test accuracies of experiment results on MNIST under the distributed setting with different numbers of workers and fixed batch size 64. Results are averaged over $5$ independent repetitions.
Figure 5: Ablation study on the momentum hyperparameter of SignRVM and dist-SignRVM.

Theorems & Definitions (20)

Proposition 3.3
Remark 3.4
Theorem 4.1
Corollary 4.2
Remark 4.3
Remark 4.4
Remark 4.5
Theorem 5.1
Corollary 5.2
Theorem 6.2
...and 10 more

Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization

TL;DR

Abstract

Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (20)