Table of Contents
Fetching ...

Astra: Efficient and Money-saving Automatic Parallel Strategies Search on Heterogeneous GPUs

Peiran Wang, Haibing Li, Fu Haohan, Shiyong Li, Yanpeng Wang, Dou Shen

TL;DR

Astra tackles the problem of auto-generating efficient parallel strategies for large-scale Transformer training across heterogeneous GPUs while also optimizing for cost. It combines MegatronLM-backed runtime, rapid input preprocessing, a three-stage search (space generation, filtering, cost simulation), and a money-aware selection to produce high-throughput strategies under diverse hardware and budget constraints. Key contributions include heterogeneous-GPU strategy modeling with reduced search complexity, an XGBoost-based cost predictor, and a Pareto-like money-limited optimization that outperforms expert-tuned plans in many scenarios. The framework significantly reduces manual tuning, enabling scalable, cost-conscious deployment of large-scale distributed training in cloud environments.

Abstract

In this paper, we introduce an efficient and money-saving automatic parallel strategies search framework on heterogeneous GPUs: Astra. First, Astra searches for the efficiency-optimal parallel strategy in both GPU configurations search space (GPU types and GPU numbers) and parallel parameters search space. Then, Astra also provides the solution on heterogeneous GPUs by mathematically modeling the time consumption of heterogeneous training. At last, Astra is the first to propose the automatic parallel strategy search on money-saving. The experiment results demonstrate that Astra can achieve better throughput than expert-designed strategies. The search time cost for Astra can also be limited to 1.27 seconds in a single-GPU setting and less than 1.35 minutes in a heterogeneous-GPU setting on average with an accuracy of over 95%.

Astra: Efficient and Money-saving Automatic Parallel Strategies Search on Heterogeneous GPUs

TL;DR

Astra tackles the problem of auto-generating efficient parallel strategies for large-scale Transformer training across heterogeneous GPUs while also optimizing for cost. It combines MegatronLM-backed runtime, rapid input preprocessing, a three-stage search (space generation, filtering, cost simulation), and a money-aware selection to produce high-throughput strategies under diverse hardware and budget constraints. Key contributions include heterogeneous-GPU strategy modeling with reduced search complexity, an XGBoost-based cost predictor, and a Pareto-like money-limited optimization that outperforms expert-tuned plans in many scenarios. The framework significantly reduces manual tuning, enabling scalable, cost-conscious deployment of large-scale distributed training in cloud environments.

Abstract

In this paper, we introduce an efficient and money-saving automatic parallel strategies search framework on heterogeneous GPUs: Astra. First, Astra searches for the efficiency-optimal parallel strategy in both GPU configurations search space (GPU types and GPU numbers) and parallel parameters search space. Then, Astra also provides the solution on heterogeneous GPUs by mathematically modeling the time consumption of heterogeneous training. At last, Astra is the first to propose the automatic parallel strategy search on money-saving. The experiment results demonstrate that Astra can achieve better throughput than expert-designed strategies. The search time cost for Astra can also be limited to 1.27 seconds in a single-GPU setting and less than 1.35 minutes in a heterogeneous-GPU setting on average with an accuracy of over 95%.

Paper Structure

This paper contains 30 sections, 26 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Different parallel methods have been proposed: b) tensor parallelism, c) data parallelism, d) pipeline parallelism, etc. Furthermore, a new parallelism paradigm, hybrid parallelism, has been primarily applied in real-world applications that combines current parallelism methods.
  • Figure 2: Astra works as follows: 1) Input Preprocess: Astra extracts MegatronLM's parameter set as its parameter search space, parses the model architecture, and generates diverse GPU configurations based on GPU type, model, and quantity. 2) Parallel Strategy Search: Using GPU configurations, parameter set, and model architecture, Astra creates parallel strategies. User-defined rules and memory constraints filter these strategies. Next, the memory-based filter computes per-stage's allocated GPU's memory. The strategy is filtered if the memory is out of the upper boundary. 3) Cost Simulation: The simulator calculates communication and computation costs using an XGBoost model to estimate each operator's time, determining the total time for each strategy. 4) Money Calculation: It computes the monetary cost of each strategy based on time and GPU configurations.
  • Figure 3: The time cost of each pipeline stage is different, and the bubble time is also different, so the total duration cannot be converted by the duration of a pipeline stage and the bubble time.
  • Figure 4: Astra' cost model is based on the XGBoost model. Each computation and communication operator's latency cost is calculated based on XGBoost's prediction on efficiency and theoretical computing power and communication bandwidth.
  • Figure 5: We compare Astra's searched optimal plan's throughput with expert's proposed plan's throughput in single-GPU setting.
  • ...and 6 more figures