Table of Contents
Fetching ...

Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum

Sarit Khirirat, Abdurakhmon Sadiev, Artem Riabinin, Eduard Gorbunov, Peter Richtárik

TL;DR

Distributed error feedback algorithms that utilize normalization to achieve the $O(1/\sqrt{K})$ convergence rate for nonconvex problems under generalized smoothness are proposed and shown to outperform their non-normalized counterparts on various tasks.

Abstract

We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems. Despite their popularity and efficiency in training deep neural networks, traditional analyses of error feedback algorithms rely on the smoothness assumption that does not capture the properties of objective functions in these problems. Rather, these problems have recently been shown to satisfy generalized smoothness assumptions, and the theoretical understanding of error feedback algorithms under these assumptions remains largely unexplored. Moreover, to the best of our knowledge, all existing analyses under generalized smoothness either i) focus on single-node settings or ii) make unrealistically strong assumptions for distributed settings, such as requiring data heterogeneity, and almost surely bounded stochastic gradient noise variance. In this paper, we propose distributed error feedback algorithms that utilize normalization to achieve the $O(1/\sqrt{K})$ convergence rate for nonconvex problems under generalized smoothness. Our analyses apply for distributed settings without data heterogeneity conditions, and enable stepsize tuning that is independent of problem parameters. Additionally, we provide strong convergence guarantees of normalized error feedback algorithms for stochastic settings. Finally, we show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks, including the minimization of polynomial functions, logistic regression, and ResNet-20 training.

Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum

TL;DR

Distributed error feedback algorithms that utilize normalization to achieve the convergence rate for nonconvex problems under generalized smoothness are proposed and shown to outperform their non-normalized counterparts on various tasks.

Abstract

We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems. Despite their popularity and efficiency in training deep neural networks, traditional analyses of error feedback algorithms rely on the smoothness assumption that does not capture the properties of objective functions in these problems. Rather, these problems have recently been shown to satisfy generalized smoothness assumptions, and the theoretical understanding of error feedback algorithms under these assumptions remains largely unexplored. Moreover, to the best of our knowledge, all existing analyses under generalized smoothness either i) focus on single-node settings or ii) make unrealistically strong assumptions for distributed settings, such as requiring data heterogeneity, and almost surely bounded stochastic gradient noise variance. In this paper, we propose distributed error feedback algorithms that utilize normalization to achieve the convergence rate for nonconvex problems under generalized smoothness. Our analyses apply for distributed settings without data heterogeneity conditions, and enable stepsize tuning that is independent of problem parameters. Additionally, we provide strong convergence guarantees of normalized error feedback algorithms for stochastic settings. Finally, we show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks, including the minimization of polynomial functions, logistic regression, and ResNet-20 training.

Paper Structure

This paper contains 34 sections, 12 theorems, 109 equations, 6 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

Consider Problem (eqn:Problem), where Assumption assum:lowerbound_whole_f (lower bound on $f$), Assumption assum:lowerbound_f_i (lower bound on $f_i$), Assumption assum:LzeroLoneSmooth (generalized smoothness of $f_i$), and Assumption assum:contractive_comp (contractive compressor) hold. Then, the i for $K \geq 0$ and $\gamma_0>0$ satify where $V^k := f(x^k)-f^{\inf} + \frac{2\gamma_k}{1-\sqrt{1-

Figures (6)

  • Figure 1: The minimization of polynomial functions using EF21 with $\gamma = \frac{1}{L + L \sqrt{\frac{\beta}{\theta}}}$, and ||EF21|| with $\gamma = \frac{\gamma_0}{\sqrt{K+1}}$, $\gamma_0 = 1$ (blue line) and $\gamma = \frac{1}{2c_1}$ (green line). Here, we ran both algorithms for (1) $L_0 = 4$, $L_1 = 1$, and $K=2,000$ (left), (2) $L_0 = 4$, $L_1 = 4$, and $K=5,000$ (middle), and (3) $L_0 = 4$, $L_1 = 8$, and $K=16,000$ (right).
  • Figure 2: Logistic regression with a nonconvex regularizer using normalized ||EF21|| and EF21. We reported $\left\| \nabla f(x^k) \right\|^2$ with respect to iteration count $k$. We used the constant stepsize $\gamma = \frac{1}{L + \tilde{L} \sqrt{\frac{\beta}{\theta}}}$ for EF21, and $\gamma = \frac{\gamma_0}{\sqrt{K+1}}$, $\gamma_0 = 1$ for ||EF21||. Here, $K=100$ for our generated data (left), and Breast Cancer (middle), while $K=400$ for a1a (right).
  • Figure 3: ResNet20 training on CIFAR-10 by using EF21 and ||EF21|| under the same stepsize $\gamma=5$ and $k=0.1d$ for a top-$k$ sparsifier.
  • Figure 4: Number of iterations required to achieve the desired accuracy, $\left\| \nabla f(x) \right\|^2 < \epsilon$, $\epsilon = 10^{-4}$, using ||EF21|| with $\gamma = \frac{\gamma_0}{\sqrt{K+1}}$ for different values of $L_0$ and $L_1$.
  • Figure 5: ResNet20 training on CIFAR-10 by using EF21 and ||EF21|| under the same stepsize $\gamma=5$ and $k=0.01d$ for a top-$k$ sparsifier.
  • ...and 1 more figures

Theorems & Definitions (22)

  • Theorem 1: Convergence of ||EF21||
  • Theorem 2: Convergence of ||EF21-SGDM||
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • ...and 12 more