ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Adel Nabli; Louis Fournier; Pierre Erbacher; Louis Serrano; Eugene Belilovsky; Edouard Oyallon

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Adel Nabli, Louis Fournier, Pierre Erbacher, Louis Serrano, Eugene Belilovsky, Edouard Oyallon

TL;DR

ACCO introduces a memory-efficient, communication–computation overlapped optimization for distributed LLM training that preserves AdamW-like convergence. By splitting minibatches into two stages and compensating for delayed updates, ACCO hides communication latency without requiring warmup or extra hyperparameters. Theoretical guarantees for SGD and extensive experiments on TinyStories, OpenWebText pretraining, and instruction fine-tuning show that ACCO matches standard training dynamics while delivering significant speedups, especially on heterogeneous hardware. This approach enables scalable, efficient LLM training under realistic memory and interconnect constraints, with practical impact for large-scale model development.

Abstract

Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, synchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (ACCO), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

TL;DR

Abstract

Paper Structure (30 sections, 4 theorems, 32 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 30 sections, 4 theorems, 32 equations, 13 figures, 5 tables, 1 algorithm.

Introduction
Related work
Overlapping communications and computations.
Memory-efficient distributed training of LLMs.
Method
Background and Notations
Distributed Data Parallelism (DDP).
Delayed Parameter Update (DPU).
Weight Prediction (WP).
ACCO: a structured approach to Communication-Computation overlap.
ACCO.
Theoretical analysis of ACCO.
Experiments
Empirical motivation for ACCO
Experimental setup
...and 15 more sections

Key Result

Proposition 3.1

Let $f : \mathbb{R}^d \to \mathbb{R}$ be $L$-smooth and $\theta^* \in \arg\min f$. For $\eta \leq \frac{1}{2L}$, define Then, for any $T \geq 1$, and initializations $\theta_0,\tilde{\theta}_0\in \mathbb{R}^d$, we have

Figures (13)

Figure 1: ACCO with a slow and a fast worker running in parallel, showing no idle time on both and hiding communications. The delayed update is compensated by splitting the mini-batch in two, leading to two steps in our timeline. The first uses half of the mini-batch to estimate "next step" parameters, and the second uses the full mini-batch to update them.
Figure 2: ACCO's two-stage mechanism (1)-(2) to compensate the delayed updates via overlapping.
Figure 3: Memory requirements of ACCO vs DDP and ZeRO-1, see Tab.\ref{['tab:relatedwork']} for quantitative details.
Figure 4: Time (per worker) spent computing and averaging gradients of a Llama-2 7B model for different numbers of GPUs.
Figure 5: Impact of the delayed update and warmup steps.
...and 8 more figures

Theorems & Definitions (6)

Proposition 3.1: GD
Proposition 3.2: SGD
Proposition A.1: Gradient Descent Case
proof
Proposition A.2: Stochastic Gradient Descent Case with Bounded Variance
proof

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

TL;DR

Abstract

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (6)