Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training

Guanbin Xu; ZhenGuo Xu; Yuzhe Li; Youhui Bai; Ping Gong; Chaoyi Ruan; Cheng Li

Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training

Guanbin Xu, ZhenGuo Xu, Yuzhe Li, Youhui Bai, Ping Gong, Chaoyi Ruan, Cheng Li

TL;DR

Lagom is presented, a system that co-tunes communication parameters to balance resource usage between computation and communication by introducing a unified cost model and a priority-based search algorithm that reduces optimization complexity from exponential to linear.

Abstract

Overlapping communication with computation is crucial for distributed large-model training, yet optimizing it - especially when computation becomes the bottleneck-remains challenging. We present Lagom, a system that co-tunes communication parameters to balance resource usage between computation and communication. By introducing a unified cost model and a priority-based search algorithm, Lagom reduces optimization complexity from exponential to linear. Evaluations on high- and low-bandwidth GPU clusters show that Lagom achieves 1.07-1.33x and 1.03-1.27x speedup over NCCL and AutoCCL across diverse models and parallelizations.

Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training

TL;DR

Abstract

Paper Structure (16 sections, 5 equations, 12 figures, 2 tables, 2 algorithms)

This paper contains 16 sections, 5 equations, 12 figures, 2 tables, 2 algorithms.

Introduction
Background and Motivation
Parallelism and Overlapping
Operators in Overlap and Parameter Tuning
Challenges
Design
Problem Definition
Contention Modeling
Priority Metric
Search Method
Evaluation
Experimental Setup
End-to-end Performance
Breakdown
Efficiency of Tuning
...and 1 more sections

Figures (12)

Figure 1: Side effects from tuning a single communication. Top: baseline execution. Bottom: after tuning Comm1, increased resource contention delays Comp2.
Figure 2: Overlaps of Different Parallelism
Figure 3: Various $NC$ and $C$
Figure 4: Various $NC$
Figure 5: Various $C$
...and 7 more figures

Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training

TL;DR

Abstract

Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training

Authors

TL;DR

Abstract

Table of Contents

Figures (12)