Cross-region Model Training with Communication-Computation Overlapping and Delay Compensation
Ying Zhu, Yang Xu, Hongli Xu, Yunming Liao, Zhiwei Yao, Liusheng Huang
TL;DR
CoCoDC tackles the inefficiency of cross-region LLM training under WAN latency by integrating Taylor-expansion–based delay compensation with an adaptive fragment synchronization strategy. This combination mitigates gradient staleness and model inconsistency caused by communication-computation overlap, while dynamically leveraging bandwidth to prioritize impactful updates. Empirical results show CoCoDC outperforms both DiLoCo and Streaming DiLoCo, achieving faster convergence and up to 21.0% fewer training steps to reach comparable perplexity. The work offers a practical framework for scalable, efficient cross-region LLM training in realistic network conditions.
Abstract
Training large language models (LLMs) requires massive computational resources, often necessitating the aggregation of geographically distributed data centers (\ie, cross-region training). However, the high communication latency in wide-area networks severely degrades the efficiency of traditional distributed training. While methods like DiLoCo reduce communication frequency, they suffer from blocking synchronization. Streaming DiLoCo alleviates this issue via communication-computation overlapping but introduces update staleness and model inconsistency due to delayed global updates and partial synchronization. These factors impair convergence, especially when aggressive overlap is needed to mask high latency. We propose CoCoDC, a novel distributed training framework with communication-computation overlapping and delay compensation, to explicitly tackle these challenges. Within the CoCoDC framework, we specifically develop a novel Delay Compensation strategy based on Taylor expansion to effectively mitigate the staleness and an Adaptive Transmission strategy that dynamically schedules model fragment synchronization to optimize bandwidth usage and accelerate convergence. Extensive experiments highlight the superior performance of CoCoDC over both DiLoCo and Streaming DiLoCo regarding final accuracy and training speed. Specifically, CoCoDC reduces the training steps needed to reach a comparable perplexity by up to 21.0% compared to Streaming DiLoCo. Our work provides an effective solution for scalable and efficient cross-region LLM training.
