Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training
Guanbin Xu, ZhenGuo Xu, Yuzhe Li, Youhui Bai, Ping Gong, Chaoyi Ruan, Cheng Li
TL;DR
Lagom is presented, a system that co-tunes communication parameters to balance resource usage between computation and communication by introducing a unified cost model and a priority-based search algorithm that reduces optimization complexity from exponential to linear.
Abstract
Overlapping communication with computation is crucial for distributed large-model training, yet optimizing it - especially when computation becomes the bottleneck-remains challenging. We present Lagom, a system that co-tunes communication parameters to balance resource usage between computation and communication. By introducing a unified cost model and a priority-based search algorithm, Lagom reduces optimization complexity from exponential to linear. Evaluations on high- and low-bandwidth GPU clusters show that Lagom achieves 1.07-1.33x and 1.03-1.27x speedup over NCCL and AutoCCL across diverse models and parallelizations.
