Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization
Daniil Medyakov, Sergey Stanko, Gleb Molodtsov, Philip Zmushko, Grigoriy Evseev, Egor Petrov, Aleksandr Beznosikov
TL;DR
The paper tackles the high computational and memory demands of training large models by developing parameter-free Sign-SGD methods that automatically adapt stepsizes for both single-node and distributed learning. It introduces two main families: SOS Sign-SGD, which uses a bisection-based search to approximate an optimal fixed stepsize, and ALIAS Sign-SGD, which adapts the stepsize per iteration using local smoothness estimates and can incorporate momentum (Adam-style) variants. The authors provide rigorous convergence analyses under convex and smooth assumptions for exact, stochastic, and distributed gradients, and demonstrate practical efficacy on large-scale tasks such as LLaMA pre-training on C4 and Swin Transformer fine-tuning on Tiny ImageNet, often outperforming tuned baselines while eliminating manual learning-rate tuning. Collectively, the work advances memory-efficient and communication-efficient optimization for large-scale models, offering parameter-free, robust methods suitable for real-world deep learning workflows with significant practical impact for NLP and vision applications.
Abstract
Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.
