Table of Contents
Fetching ...

Improving training time and GPU utilization in geo-distributed language model training

Palak, Tella Rajashekhar Reddy, Bhaskar Kataria, Rohan Gandhi, Karan Tandon, Debopam Bhattacherjee, Venkata N. Padmanabhan

TL;DR

This work tackles the challenge of training large language models across geo-distributed data centers linked by WAN, where high latency and limited bandwidth create bubbles and low GPU utilization. It introduces Atlas, a WAN-aware scheduler that uses multiple TCP connections, cross-DC DP/PP coordination, and a DC-selection heuristic to speed up training, and BubbleTea, which fills the idle GPU time with prefill workloads from inference to boost utilization without harming training. Together, they achieve up to 17x faster training and up to 94% GPU utilization in cross-DC LM training, validated by testbeds and large-scale simulations, with open-source code. The approach preserves training semantics, avoids heavy compression or quantization, and offers practical what-if modeling to plan DC deployments, making geo-distributed LM training more scalable and cost-effective.

Abstract

The widespread adoption of language models (LMs) has caused a huge surge in demand for GPUs. Training large LMs requires tens of thousands of GPUs and housing them in the same datacenter (DC) is a challenge due to many constraints including availability of peak power. We focus on training such models across multiple DCs connected via the Wide-Area-Network (WAN). We built Atlas that speeds up the training time using novel workload-aware temporal bandwidth sharing and other design choices. While Atlas improves the training time, it does not completely eliminate the bubbles (idle GPU cycles). We built BubbleTea that runs prefill-as-a-service (part of LM inference) during the bubbles thus improving the GPU utilization without any impact on training. Compared to state-of-the-art designs, Atlas and BubbleTea together achieve up to 17x faster training, and up to 94% GPU utilization. The code will be open-sourced.

Improving training time and GPU utilization in geo-distributed language model training

TL;DR

This work tackles the challenge of training large language models across geo-distributed data centers linked by WAN, where high latency and limited bandwidth create bubbles and low GPU utilization. It introduces Atlas, a WAN-aware scheduler that uses multiple TCP connections, cross-DC DP/PP coordination, and a DC-selection heuristic to speed up training, and BubbleTea, which fills the idle GPU time with prefill workloads from inference to boost utilization without harming training. Together, they achieve up to 17x faster training and up to 94% GPU utilization in cross-DC LM training, validated by testbeds and large-scale simulations, with open-source code. The approach preserves training semantics, avoids heavy compression or quantization, and offers practical what-if modeling to plan DC deployments, making geo-distributed LM training more scalable and cost-effective.

Abstract

The widespread adoption of language models (LMs) has caused a huge surge in demand for GPUs. Training large LMs requires tens of thousands of GPUs and housing them in the same datacenter (DC) is a challenge due to many constraints including availability of peak power. We focus on training such models across multiple DCs connected via the Wide-Area-Network (WAN). We built Atlas that speeds up the training time using novel workload-aware temporal bandwidth sharing and other design choices. While Atlas improves the training time, it does not completely eliminate the bubbles (idle GPU cycles). We built BubbleTea that runs prefill-as-a-service (part of LM inference) during the bubbles thus improving the GPU utilization without any impact on training. Compared to state-of-the-art designs, Atlas and BubbleTea together achieve up to 17x faster training, and up to 94% GPU utilization. The code will be open-sourced.

Paper Structure

This paper contains 29 sections, 15 figures, 2 tables, 1 algorithm.

Figures (15)

  • Figure 1: Experimental setup consisting of 6 GPUs in 3 DCs connected by WAN. WAN offers lower bandwidth compared to intra-DC bandwidth (thicker lines).
  • Figure 2: Slow-down in training time in DP as we increase the WAN latency. $y$-values are multipliers and hence unit-less.
  • Figure 3: Slow-down in training time in PP as we increase the WAN latency. $y$-values are multipliers and hence unit-less.
  • Figure 4: Execution of compute phases (forward, backward, recompute) in PP in Varuna across 6 GPUs (G-1 to G-6) in 3 DCs for L,H = 6K,8K and WAN latency of 40 msec.
  • Figure 5: Bandwidth for single and multiple TCP connections. The server is located in US-East, and we vary the location of the client (X-axis). US-SC denotes US South-central DC. The numbers over the bars denote one-way latencies.
  • ...and 10 more figures