Table of Contents
Fetching ...

Distributed Low-Communication Training with Decoupled Momentum Optimization

Sasho Nedelkoski, Alexander Acker, Odej Kao, Soeren Becker, Dominik Scheinert

TL;DR

This work targets the bottleneck of inter-node communication in distributed training of large models by marrying federated-learning–style local updates with momentum compression. It decomposes Nesterov momentum into high- and low-frequency components via the discrete cosine transform and synchronizes only the top-$k$ high-frequency momentum terms every $H$ steps, while preserving locally accumulated low-frequency components. Empirically, the approach achieves up to $16\times$ communication reduction compared with DiLoCo and up to $3000\times$ with respect to standard data-parallel training, with modest perplexity and accuracy trade-offs on GPT-NeoX (C4) and ResNet-ImageNet-1k across 2- and 4-node configurations. The results demonstrate the practicality of training large-scale models on low-bandwidth distributed resources and motivate further theoretical and empirical exploration of momentum-compression-based distributed optimization.

Abstract

The training of large models demands substantial computational resources, typically available only in data centers with high-bandwidth interconnects. However, reducing the reliance on high-bandwidth interconnects between nodes enables the use of distributed compute resources as an alternative to centralized data center training. Building on recent advances in distributed model training, we propose an approach that further reduces communication by combining infrequent synchronizations across distributed model replicas with gradient momentum compression. In particular, we treat the optimizer momentum as a signal and decompose the Nesterov momentum into high- and low-frequency components via the discrete cosine transform (DCT). Only the high-frequency components are synchronized across model replicas every $H$ steps. Empirically, our method achieves up to a $16\times$ reduction in communication compared to the baseline DiLoCo, and it generalizes across architectures, including transformer-based language models and convolutional neural networks for images. Overall, this work advances the feasibility of training large models on distributed nodes with low-bandwidth interconnects.

Distributed Low-Communication Training with Decoupled Momentum Optimization

TL;DR

This work targets the bottleneck of inter-node communication in distributed training of large models by marrying federated-learning–style local updates with momentum compression. It decomposes Nesterov momentum into high- and low-frequency components via the discrete cosine transform and synchronizes only the top- high-frequency momentum terms every steps, while preserving locally accumulated low-frequency components. Empirically, the approach achieves up to communication reduction compared with DiLoCo and up to with respect to standard data-parallel training, with modest perplexity and accuracy trade-offs on GPT-NeoX (C4) and ResNet-ImageNet-1k across 2- and 4-node configurations. The results demonstrate the practicality of training large-scale models on low-bandwidth distributed resources and motivate further theoretical and empirical exploration of momentum-compression-based distributed optimization.

Abstract

The training of large models demands substantial computational resources, typically available only in data centers with high-bandwidth interconnects. However, reducing the reliance on high-bandwidth interconnects between nodes enables the use of distributed compute resources as an alternative to centralized data center training. Building on recent advances in distributed model training, we propose an approach that further reduces communication by combining infrequent synchronizations across distributed model replicas with gradient momentum compression. In particular, we treat the optimizer momentum as a signal and decompose the Nesterov momentum into high- and low-frequency components via the discrete cosine transform (DCT). Only the high-frequency components are synchronized across model replicas every steps. Empirically, our method achieves up to a reduction in communication compared to the baseline DiLoCo, and it generalizes across architectures, including transformer-based language models and convolutional neural networks for images. Overall, this work advances the feasibility of training large models on distributed nodes with low-bandwidth interconnects.

Paper Structure

This paper contains 12 sections, 2 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: (a) Language modeling training loss; (b) Image classification training loss