Astra: Efficient and Money-saving Automatic Parallel Strategies Search on Heterogeneous GPUs
Peiran Wang, Haibing Li, Fu Haohan, Shiyong Li, Yanpeng Wang, Dou Shen
TL;DR
Astra tackles the problem of auto-generating efficient parallel strategies for large-scale Transformer training across heterogeneous GPUs while also optimizing for cost. It combines MegatronLM-backed runtime, rapid input preprocessing, a three-stage search (space generation, filtering, cost simulation), and a money-aware selection to produce high-throughput strategies under diverse hardware and budget constraints. Key contributions include heterogeneous-GPU strategy modeling with reduced search complexity, an XGBoost-based cost predictor, and a Pareto-like money-limited optimization that outperforms expert-tuned plans in many scenarios. The framework significantly reduces manual tuning, enabling scalable, cost-conscious deployment of large-scale distributed training in cloud environments.
Abstract
In this paper, we introduce an efficient and money-saving automatic parallel strategies search framework on heterogeneous GPUs: Astra. First, Astra searches for the efficiency-optimal parallel strategy in both GPU configurations search space (GPU types and GPU numbers) and parallel parameters search space. Then, Astra also provides the solution on heterogeneous GPUs by mathematically modeling the time consumption of heterogeneous training. At last, Astra is the first to propose the automatic parallel strategy search on money-saving. The experiment results demonstrate that Astra can achieve better throughput than expert-designed strategies. The search time cost for Astra can also be limited to 1.27 seconds in a single-GPU setting and less than 1.35 minutes in a heterogeneous-GPU setting on average with an accuracy of over 95%.
