Improving training time and GPU utilization in geo-distributed language model training
Palak, Tella Rajashekhar Reddy, Bhaskar Kataria, Rohan Gandhi, Karan Tandon, Debopam Bhattacherjee, Venkata N. Padmanabhan
TL;DR
This work tackles the challenge of training large language models across geo-distributed data centers linked by WAN, where high latency and limited bandwidth create bubbles and low GPU utilization. It introduces Atlas, a WAN-aware scheduler that uses multiple TCP connections, cross-DC DP/PP coordination, and a DC-selection heuristic to speed up training, and BubbleTea, which fills the idle GPU time with prefill workloads from inference to boost utilization without harming training. Together, they achieve up to 17x faster training and up to 94% GPU utilization in cross-DC LM training, validated by testbeds and large-scale simulations, with open-source code. The approach preserves training semantics, avoids heavy compression or quantization, and offers practical what-if modeling to plan DC deployments, making geo-distributed LM training more scalable and cost-effective.
Abstract
The widespread adoption of language models (LMs) has caused a huge surge in demand for GPUs. Training large LMs requires tens of thousands of GPUs and housing them in the same datacenter (DC) is a challenge due to many constraints including availability of peak power. We focus on training such models across multiple DCs connected via the Wide-Area-Network (WAN). We built Atlas that speeds up the training time using novel workload-aware temporal bandwidth sharing and other design choices. While Atlas improves the training time, it does not completely eliminate the bubbles (idle GPU cycles). We built BubbleTea that runs prefill-as-a-service (part of LM inference) during the bubbles thus improving the GPU utilization without any impact on training. Compared to state-of-the-art designs, Atlas and BubbleTea together achieve up to 17x faster training, and up to 94% GPU utilization. The code will be open-sourced.
