Table of Contents
Fetching ...

Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters

WenZheng Zhang, Yang Hu, Jing Shi, Xiaoying Bai

TL;DR

Poplar presents a heterogeneous-aware extension of ZeRO for distributed DNN training, combining online GPU profiling with offline batch-size optimization to automatically allocate work across nonuniform GPUs. It builds per-GPU performance curves via cubic spline interpolation and uses a search-based mechanism to assign micro-batches that minimize idle time and communication overhead. Empirical results on three real heterogeneous clusters show throughput improvements up to $3.92\times$ over baselines like DeepSpeed and Whale, with substantial gains in ZeRO-2/3 by reducing gradient accumulation and improving load balance. The work demonstrates practical scalability of large-scale DNN training in realistic mixed-GPU environments and offers automation that reduces manual tuning and expert intervention.

Abstract

Scaling Deep Neural Networks (DNNs) requires significant computational resources in terms of GPU quantity and compute capacity. In practice, there usually exists a large number of heterogeneous GPU devices due to the rapid release cycle of GPU products. It is highly needed to efficiently and economically harness the power of heterogeneous GPUs, so that it can meet the requirements of DNN research and development. The paper introduces Poplar, a distributed training system that extends Zero Redundancy Optimizer (ZeRO) with heterogeneous-aware capabilities. We explore a broader spectrum of GPU heterogeneity, including compute capability, memory capacity, quantity and a combination of them. In order to achieve high computational efficiency across all heterogeneous conditions, Poplar conducts fine-grained measurements of GPUs in each ZeRO stage. We propose a novel batch allocation method and a search algorithm to optimize the utilization of heterogeneous GPUs clusters. Furthermore, Poplar implements fully automated parallelism, eliminating the need for deploying heterogeneous hardware and finding suitable batch size. Extensive experiments on three heterogeneous clusters, comprising six different types of GPUs, demonstrate that Poplar achieves a training throughput improvement of 1.02-3.92x over current state-of-the-art heterogeneous training systems.

Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters

TL;DR

Poplar presents a heterogeneous-aware extension of ZeRO for distributed DNN training, combining online GPU profiling with offline batch-size optimization to automatically allocate work across nonuniform GPUs. It builds per-GPU performance curves via cubic spline interpolation and uses a search-based mechanism to assign micro-batches that minimize idle time and communication overhead. Empirical results on three real heterogeneous clusters show throughput improvements up to over baselines like DeepSpeed and Whale, with substantial gains in ZeRO-2/3 by reducing gradient accumulation and improving load balance. The work demonstrates practical scalability of large-scale DNN training in realistic mixed-GPU environments and offers automation that reduces manual tuning and expert intervention.

Abstract

Scaling Deep Neural Networks (DNNs) requires significant computational resources in terms of GPU quantity and compute capacity. In practice, there usually exists a large number of heterogeneous GPU devices due to the rapid release cycle of GPU products. It is highly needed to efficiently and economically harness the power of heterogeneous GPUs, so that it can meet the requirements of DNN research and development. The paper introduces Poplar, a distributed training system that extends Zero Redundancy Optimizer (ZeRO) with heterogeneous-aware capabilities. We explore a broader spectrum of GPU heterogeneity, including compute capability, memory capacity, quantity and a combination of them. In order to achieve high computational efficiency across all heterogeneous conditions, Poplar conducts fine-grained measurements of GPUs in each ZeRO stage. We propose a novel batch allocation method and a search algorithm to optimize the utilization of heterogeneous GPUs clusters. Furthermore, Poplar implements fully automated parallelism, eliminating the need for deploying heterogeneous hardware and finding suitable batch size. Extensive experiments on three heterogeneous clusters, comprising six different types of GPUs, demonstrate that Poplar achieves a training throughput improvement of 1.02-3.92x over current state-of-the-art heterogeneous training systems.
Paper Structure (27 sections, 12 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 27 sections, 12 equations, 8 figures, 2 tables, 2 algorithms.

Figures (8)

  • Figure 1: High-end GPUs complete their tasks first and then start waiting before synchronization. Without load balancing, there will be significant idle time.
  • Figure 2: An overview of how Poplar automatically determines the optimal configuration..
  • Figure 3: The performance on three types of heterogeneous environments. Poplar performs better than all baselines.
  • Figure 4: Results on different models. Poplar performs better on BERT compared to Llama. Due to a small micro batch size, Poplar performs less well at 1.1B parameters than 0.5B parameters.
  • Figure 5: The evaluation on Poplar's training capabilities across varying numbers of heterogeneous GPUs. The numbers and letters in the figure indicate the quantity of corresponding GPUs, for example, V4 denotes four V100S, A4 denotes four A800, and V4A1 denotes four V100S with one A800.
  • ...and 3 more figures