Table of Contents
Fetching ...

DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes

Mogens Henrik From, Jacob Nielsen, Lukas Galke Poech, Peter Schneider-Kamp

TL;DR

The paper addresses the high communication cost of training large models on distributed hardware by extending Decoupled Momentum to a fully sharded intra-node setting. It introduces FlexDeMo, a hybrid sharded data-parallel framework that shreds model/optimizer state within a node while replicating fast-moving momentum components between nodes, using replication schemes such as Random, Striding, and DiLoCo. FlexDeMo achieves comparable validation performance to full-gradient methods like AdamW while delivering significant speedups, especially under limited inter-node bandwidth, and enables training models that exceed single-accelerator memory. The work demonstrates cross-domain efficacy (T5, ViT, OLMo2), analyzes bandwidth trade-offs, and provides practical guidance for scaling large models with reduced network demands. Overall, FlexDeMo reduces inter-node communication bottlenecks and broadens the feasible space for large-scale distributed training.

Abstract

Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, DeMo assumes that models fit on a single accelerator. We relax this assumption and introduce FlexDeMo, whereby nodes fully shard model parameters locally between different accelerators, while inter-node communication is reduced by synchronizing only fast-moving components instead of the full gradients -- resulting in a hybrid sharded data parallel training strategy. We further introduce a framework, denoted as DeToNATION, that generalizes DeMo, FlexDeMo, and other popular distributed training schemes such as DiLoCo -- introducing new variations of replication schemes and challenging choices made in DeMo. Our results across language and vision domains show that FlexDeMo attains similar validation loss as hybrid sharded data parallel training employing AdamW and full gradient synchronization, while being substantially faster. FlexDeMo is thus a promising distributed training scheme for the largest machine learning models.

DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes

TL;DR

The paper addresses the high communication cost of training large models on distributed hardware by extending Decoupled Momentum to a fully sharded intra-node setting. It introduces FlexDeMo, a hybrid sharded data-parallel framework that shreds model/optimizer state within a node while replicating fast-moving momentum components between nodes, using replication schemes such as Random, Striding, and DiLoCo. FlexDeMo achieves comparable validation performance to full-gradient methods like AdamW while delivering significant speedups, especially under limited inter-node bandwidth, and enables training models that exceed single-accelerator memory. The work demonstrates cross-domain efficacy (T5, ViT, OLMo2), analyzes bandwidth trade-offs, and provides practical guidance for scaling large models with reduced network demands. Overall, FlexDeMo reduces inter-node communication bottlenecks and broadens the feasible space for large-scale distributed training.

Abstract

Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, DeMo assumes that models fit on a single accelerator. We relax this assumption and introduce FlexDeMo, whereby nodes fully shard model parameters locally between different accelerators, while inter-node communication is reduced by synchronizing only fast-moving components instead of the full gradients -- resulting in a hybrid sharded data parallel training strategy. We further introduce a framework, denoted as DeToNATION, that generalizes DeMo, FlexDeMo, and other popular distributed training schemes such as DiLoCo -- introducing new variations of replication schemes and challenging choices made in DeMo. Our results across language and vision domains show that FlexDeMo attains similar validation loss as hybrid sharded data parallel training employing AdamW and full gradient synchronization, while being substantially faster. FlexDeMo is thus a promising distributed training scheme for the largest machine learning models.

Paper Structure

This paper contains 45 sections, 19 figures.

Figures (19)

  • Figure 1: T5-Large Validation loss on the Opus Books En-Fr subset. Random and DeMo replication demonstrates strong performance.
  • Figure 2: ViT-B Validation loss on Cifar100. DeMo and DiLoCo perform best and similar to each other.
  • Figure 3: OLMo2 1B train loss (zoomed) over 10K training steps on Dolma v1.6 using different replicators and compression rates. All experiments, except the Hybrid-FSDP baseline with AdamW on two nodes, use DeMo-SGD.
  • Figure 4: OLMo2 1B train loss vs. wall-clock time over 10K training steps on Dolma v1.6, comparing different replicators and compression rates. All experiments use DeMo-SGD except for the Hybrid-FSDP baseline with AdamW. FlexDeMo shows clear improvements in convergence speed.
  • Figure 5: Average time per optimizer step for ViT-B on two nodes across employing different replicators using DeMo-SGD and Decoupled AdamW as base optimizers
  • ...and 14 more figures