Table of Contents
Fetching ...

Efficient Sign-Based Optimization: Accelerating Convergence via Variance Reduction

Wei Jiang, Sifan Yang, Wenhao Yang, Lijun Zhang

TL;DR

This work addresses non-convex stochastic optimization under sign-based updates by introducing Sign-based Stochastic Variance Reduction (SSVR), which combines variance-reduced gradient estimators with sign-based updates. The proposed approach achieves a faster convergence rate of $O(d^{1/2}T^{-1/3})$ for general non-convex objectives and, for finite-sum problems, $O(m^{1/4}d^{1/2}T^{-1/2})$, surpassing prior sign-based methods. In distributed settings, the authors extend this to SSVR-MV with majority vote, obtaining rates of $O(d^{1/2}T^{-1/2} + dn^{-1/2})$ and $O(d^{1/4}T^{-1/4})$ under heterogeneous data distributions. The empirical results on CIFAR-10/100 corroborate the theoretical gains, showing improved convergence and accuracy with 1-bit communication, highlighting practical potential for scalable, communication-efficient distributed learning.

Abstract

Sign stochastic gradient descent (signSGD) is a communication-efficient method that transmits only the sign of stochastic gradients for parameter updating. Existing literature has demonstrated that signSGD can achieve a convergence rate of $\mathcal{O}(d^{1/2}T^{-1/4})$, where $d$ represents the dimension and $T$ is the iteration number. In this paper, we improve this convergence rate to $\mathcal{O}(d^{1/2}T^{-1/3})$ by introducing the Sign-based Stochastic Variance Reduction (SSVR) method, which employs variance reduction estimators to track gradients and leverages their signs to update. For finite-sum problems, our method can be further enhanced to achieve a convergence rate of $\mathcal{O}(m^{1/4}d^{1/2}T^{-1/2})$, where $m$ denotes the number of component functions. Furthermore, we investigate the heterogeneous majority vote in distributed settings and introduce two novel algorithms that attain improved convergence rates of $\mathcal{O}(d^{1/2}T^{-1/2} + dn^{-1/2})$ and $\mathcal{O}(d^{1/4}T^{-1/4})$ respectively, outperforming the previous results of $\mathcal{O}(dT^{-1/4} + dn^{-1/2})$ and $\mathcal{O}(d^{3/8}T^{-1/8})$, where $n$ represents the number of nodes. Numerical experiments across different tasks validate the effectiveness of our proposed methods.

Efficient Sign-Based Optimization: Accelerating Convergence via Variance Reduction

TL;DR

This work addresses non-convex stochastic optimization under sign-based updates by introducing Sign-based Stochastic Variance Reduction (SSVR), which combines variance-reduced gradient estimators with sign-based updates. The proposed approach achieves a faster convergence rate of for general non-convex objectives and, for finite-sum problems, , surpassing prior sign-based methods. In distributed settings, the authors extend this to SSVR-MV with majority vote, obtaining rates of and under heterogeneous data distributions. The empirical results on CIFAR-10/100 corroborate the theoretical gains, showing improved convergence and accuracy with 1-bit communication, highlighting practical potential for scalable, communication-efficient distributed learning.

Abstract

Sign stochastic gradient descent (signSGD) is a communication-efficient method that transmits only the sign of stochastic gradients for parameter updating. Existing literature has demonstrated that signSGD can achieve a convergence rate of , where represents the dimension and is the iteration number. In this paper, we improve this convergence rate to by introducing the Sign-based Stochastic Variance Reduction (SSVR) method, which employs variance reduction estimators to track gradients and leverages their signs to update. For finite-sum problems, our method can be further enhanced to achieve a convergence rate of , where denotes the number of component functions. Furthermore, we investigate the heterogeneous majority vote in distributed settings and introduce two novel algorithms that attain improved convergence rates of and respectively, outperforming the previous results of and , where represents the number of nodes. Numerical experiments across different tasks validate the effectiveness of our proposed methods.
Paper Structure (24 sections, 14 theorems, 113 equations, 5 figures, 2 tables, 3 algorithms)

This paper contains 24 sections, 14 theorems, 113 equations, 5 figures, 2 tables, 3 algorithms.

Key Result

Theorem 1

Under Assumptions ass:2 and ass:3, by setting $\beta = \mathcal{O}(\frac{1}{T^{2/3}})$, $\eta = \mathcal{O}(\frac{1}{d^{1/2} T^{2/3}})$, $B_0 = \mathcal{O}(T^{1/3})$, and $B_1=\mathcal{O}(1)$, our SSVR method ensures:

Figures (5)

  • Figure 1: Results for CIFAR-10 dataset in the centralized environment.
  • Figure 2: Results for CIFAR-100 dataset in the distributed environment.
  • Figure 3: Results for CIFAR-10 dataset with different learning rates.
  • Figure 4: Results for CIFAR-10 dataset with different $\beta$.
  • Figure 5: Results for CIFAR-10 dataset with different batch sizes.

Theorems & Definitions (15)

  • Theorem 1
  • Theorem 2
  • Definition 1
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • ...and 5 more