ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training
Adel Nabli, Louis Fournier, Pierre Erbacher, Louis Serrano, Eugene Belilovsky, Edouard Oyallon
TL;DR
ACCO introduces a memory-efficient, communication–computation overlapped optimization for distributed LLM training that preserves AdamW-like convergence. By splitting minibatches into two stages and compensating for delayed updates, ACCO hides communication latency without requiring warmup or extra hyperparameters. Theoretical guarantees for SGD and extensive experiments on TinyStories, OpenWebText pretraining, and instruction fine-tuning show that ACCO matches standard training dynamics while delivering significant speedups, especially on heterogeneous hardware. This approach enables scalable, efficient LLM training under realistic memory and interconnect constraints, with practical impact for large-scale model development.
Abstract
Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, synchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (ACCO), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.
