Asynchronous Local-SGD Training for Language Modeling

Bo Liu; Rachita Chhaparia; Arthur Douillard; Satyen Kale; Andrei A. Rusu; Jiajun Shen; Arthur Szlam; Marc'Aurelio Ranzato

Asynchronous Local-SGD Training for Language Modeling

Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato

TL;DR

A novel method that utilizes a delayed Nesterov momentum update and adjusts the workers' local training steps based on their computation speed is proposed, which matches the performance of synchronous Local-SGD in terms of perplexity per update step, and significantly surpasses it in terms of wall clock time.

Abstract

Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of {\it asynchronous} Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the global parameters when worker gradients are stale as a key challenge. We propose a novel method that utilizes a delayed Nesterov momentum update and adjusts the workers' local training steps based on their computation speed. This approach, evaluated with models up to 150M parameters on the C4 dataset, matches the performance of synchronous Local-SGD in terms of perplexity per update step, and significantly surpasses it in terms of wall clock time.

Asynchronous Local-SGD Training for Language Modeling

TL;DR

Abstract

Paper Structure (38 sections, 6 equations, 12 figures, 6 tables, 5 algorithms)

This paper contains 38 sections, 6 equations, 12 figures, 6 tables, 5 algorithms.

Introduction
1. Framework (Section \ref{['sec:async_framework']}).
2. Optimization Challenge (Section \ref{['sec:challenge']}).
3. Proposed Solutions (Section \ref{['sec:method']}).
Background
Async. Local-SGD Framework
Data Shard Sampling
Learning Rate Scheduling
Grace Period for Model Synchronization
Asynchronous Task Scheduling
Optimization Challenge
Effect of InnerOpt + OuterOpt
Momentum in the OuterOpt
Is Staleness the Cause?
Baselines
...and 23 more sections

Figures (12)

Figure 1: Illustration of async. v.s. sync. training with 2 workers (in blue and red). Sync. training suffers from the straggler effect, while async. training reduces the idling time of the fast worker.
Figure 2: Comparative evaluation of language models using sync. and async. Local-SGD methods with 4 heterogeneous workers on a 20M parameter model. The state-of-the-art sync. Local-SGD method, DiLoCo douillard2023diloco, employs AdamW and Nesterov momentum as the worker-side and server-side optimizers, respectively. This optimizer combination remains the strongest for async. Local-SGD training (See Figure \ref{['fig:all_optimizer']}), yet underperforms DiLoCo significantly. By integrating Delayed Nesterov (DN) (Algorithm \ref{['alg:delayed_nesterov']}) for outer optimization and Dynamic Local Updates (DyLU) (Section \ref{['sec:dylu']}), we significantly bridge the performance gap in terms of perplexity versus updates between sync. and async. training in language modeling. Moreover, the proposed method significantly surpasses DiLoCo in terms of perplexity versus wall clock time.
Figure 3: We consecutively synchronize the update from B after we synchronize A because B finishes its training after A but before the end of the grace period. A and B will therefore use the same server model to start the new training jobs, while C will start its own grace period.
Figure 4: Steps per second for each device.
Figure 5: Performance of using different combinations of inner and outer optimizers for asynchronous Local-SGD training on a 20M language model with 4 workers.
...and 7 more figures

Asynchronous Local-SGD Training for Language Modeling

TL;DR

Abstract

Asynchronous Local-SGD Training for Language Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (12)