Table of Contents
Fetching ...

Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization

Daniil Medyakov, Sergey Stanko, Gleb Molodtsov, Philip Zmushko, Grigoriy Evseev, Egor Petrov, Aleksandr Beznosikov

TL;DR

The paper tackles the high computational and memory demands of training large models by developing parameter-free Sign-SGD methods that automatically adapt stepsizes for both single-node and distributed learning. It introduces two main families: SOS Sign-SGD, which uses a bisection-based search to approximate an optimal fixed stepsize, and ALIAS Sign-SGD, which adapts the stepsize per iteration using local smoothness estimates and can incorporate momentum (Adam-style) variants. The authors provide rigorous convergence analyses under convex and smooth assumptions for exact, stochastic, and distributed gradients, and demonstrate practical efficacy on large-scale tasks such as LLaMA pre-training on C4 and Swin Transformer fine-tuning on Tiny ImageNet, often outperforming tuned baselines while eliminating manual learning-rate tuning. Collectively, the work advances memory-efficient and communication-efficient optimization for large-scale models, offering parameter-free, robust methods suitable for real-world deep learning workflows with significant practical impact for NLP and vision applications.

Abstract

Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.

Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization

TL;DR

The paper tackles the high computational and memory demands of training large models by developing parameter-free Sign-SGD methods that automatically adapt stepsizes for both single-node and distributed learning. It introduces two main families: SOS Sign-SGD, which uses a bisection-based search to approximate an optimal fixed stepsize, and ALIAS Sign-SGD, which adapts the stepsize per iteration using local smoothness estimates and can incorporate momentum (Adam-style) variants. The authors provide rigorous convergence analyses under convex and smooth assumptions for exact, stochastic, and distributed gradients, and demonstrate practical efficacy on large-scale tasks such as LLaMA pre-training on C4 and Swin Transformer fine-tuning on Tiny ImageNet, often outperforming tuned baselines while eliminating manual learning-rate tuning. Collectively, the work advances memory-efficient and communication-efficient optimization for large-scale models, offering parameter-free, robust methods suitable for real-world deep learning workflows with significant practical impact for NLP and vision applications.

Abstract

Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.

Paper Structure

This paper contains 36 sections, 24 theorems, 171 equations, 5 figures, 5 tables, 8 algorithms.

Key Result

theorem 1

Suppose Assumptions as:smoothness, as:convexity, as:func_optimum, as:go_exact_gradient hold. Then for Algorithm alg:sign-sgd_bisection after obtaining the stepsize $\gamma_0$ the following estimate is valid: Moreover, taking into account the complexity of Algorithm alg:bisection_procedure in relation to the initial stepsize bound $\gamma_s$, to reach $\varepsilon$-accuracy, where $\varepsilon = \

Figures (5)

  • Figure 1: Sign-SGD methods on logistic regression.
  • Figure 2: Comparison of Sign-SGD methods on problem \ref{['exp:nllsq']}.
  • Figure 3: Comparison of Sign-SGD methods on LlaMA pre-training. Left column is results for methods without weight decay, central column -- methods with weight decay, right column -- methods with momentum parameter $\beta$.
  • Figure 4: Comparison of ALIAS Adam version stepsize with constant $\gamma^t$ with effective cosine stepsize scheduler.
  • Figure 5: Sign-SGD methods with added momentum parameter $(\beta)$, AdamW (wd) and Prodigy on Swin fine-tuning. Left plot represents full process of training, right plot demonstrates accuracy on last 20 epoch.

Theorems & Definitions (56)

  • theorem 1
  • theorem 2
  • remark 1
  • theorem 3
  • remark 2
  • lemma 1: Quadratic inequality
  • proof
  • lemma 2: Bisection entry
  • proof
  • lemma 3: Bisection invariants
  • ...and 46 more