Table of Contents
Fetching ...

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Adel Nabli, Louis Fournier, Pierre Erbacher, Louis Serrano, Eugene Belilovsky, Edouard Oyallon

TL;DR

ACCO introduces a memory-efficient, communication–computation overlapped optimization for distributed LLM training that preserves AdamW-like convergence. By splitting minibatches into two stages and compensating for delayed updates, ACCO hides communication latency without requiring warmup or extra hyperparameters. Theoretical guarantees for SGD and extensive experiments on TinyStories, OpenWebText pretraining, and instruction fine-tuning show that ACCO matches standard training dynamics while delivering significant speedups, especially on heterogeneous hardware. This approach enables scalable, efficient LLM training under realistic memory and interconnect constraints, with practical impact for large-scale model development.

Abstract

Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, synchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (ACCO), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

TL;DR

ACCO introduces a memory-efficient, communication–computation overlapped optimization for distributed LLM training that preserves AdamW-like convergence. By splitting minibatches into two stages and compensating for delayed updates, ACCO hides communication latency without requiring warmup or extra hyperparameters. Theoretical guarantees for SGD and extensive experiments on TinyStories, OpenWebText pretraining, and instruction fine-tuning show that ACCO matches standard training dynamics while delivering significant speedups, especially on heterogeneous hardware. This approach enables scalable, efficient LLM training under realistic memory and interconnect constraints, with practical impact for large-scale model development.

Abstract

Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, synchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (ACCO), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.
Paper Structure (30 sections, 4 theorems, 32 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 30 sections, 4 theorems, 32 equations, 13 figures, 5 tables, 1 algorithm.

Key Result

Proposition 3.1

Let $f : \mathbb{R}^d \to \mathbb{R}$ be $L$-smooth and $\theta^* \in \arg\min f$. For $\eta \leq \frac{1}{2L}$, define Then, for any $T \geq 1$, and initializations $\theta_0,\tilde{\theta}_0\in \mathbb{R}^d$, we have

Figures (13)

  • Figure 1: ACCO with a slow and a fast worker running in parallel, showing no idle time on both and hiding communications. The delayed update is compensated by splitting the mini-batch in two, leading to two steps in our timeline. The first uses half of the mini-batch to estimate "next step" parameters, and the second uses the full mini-batch to update them.
  • Figure 2: ACCO's two-stage mechanism (1)-(2) to compensate the delayed updates via overlapping.
  • Figure 3: Memory requirements of ACCO vs DDP and ZeRO-1, see Tab.\ref{['tab:relatedwork']} for quantitative details.
  • Figure 4: Time (per worker) spent computing and averaging gradients of a Llama-2 7B model for different numbers of GPUs.
  • Figure 5: Impact of the delayed update and warmup steps.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Proposition 3.1: GD
  • Proposition 3.2: SGD
  • Proposition A.1: Gradient Descent Case
  • proof
  • Proposition A.2: Stochastic Gradient Descent Case with Bounded Variance
  • proof