Table of Contents
Fetching ...

DeepCEE: Efficient Cross-Region Model Distributed Training System under Heterogeneous GPUs and Networks

Jinquan Wang, Xiaojian Liao, Xuzhao Liu, Jiashun Suo, Zhisheng Huo, Chenhao Zhang, Xiangrong Xu, Runnan Shen, Xilong Xie, Limin Xiao

TL;DR

This work tackles the challenge of efficient cross-region training in cloud-edge-end (CEE) environments, where hierarchical network topology and fluctuating bandwidth hinder traditional parallelism. It introduces DeepCEE, a network-centric geo-distributed training system that automatically derives asymmetric parallel strategies through a Heterogeneous Devices Profiler, a Parallel Strategy Planner, and a Dynamic Environment Adapter. By forming two-level device groups, employing compact zero-bubble pipeline parallelism, and dynamically adjusting micro-batches during network fluctuations, DeepCEE achieves substantial throughput improvements (1.3–2.8× over SOTA) and robust adaptation under changing network conditions. The findings demonstrate the practical potential of exploiting idle edge GPUs and heterogeneous networks for scalable, efficient distributed training in real-world multi-region deployments.

Abstract

Most existing training systems focus on a single region. In contrast, we envision that cross-region training offers more flexible GPU resource allocation and yields significant potential. However, the hierarchical cluster topology and unstable networks in the cloud-edge-end (CEE) environment, a typical cross-region scenario, pose substantial challenges to building an efficient and autonomous model training system. We propose DeepCEE, a geo-distributed model training system tailored for heterogeneous GPUs and networks in CEE environments. DeepCEE adopts a communication-centric design philosophy to tackle challenges arising from slow and unstable inter-region networks. It begins with a heterogeneous device profiler that identifies and groups devices based on both network and compute characteristics. Leveraging device groups, DeepCEE implements compact, zero-bubble pipeline parallelism, automatically deriving optimal parallel strategies. To further adapt to runtime variability, DeepCEE integrates a dynamic environment adapter that reacts to network fluctuations. Extensive evaluations demonstrate that DeepCEE achieves 1.3-2.8x higher training throughput compared to widely used and SOTA training systems.

DeepCEE: Efficient Cross-Region Model Distributed Training System under Heterogeneous GPUs and Networks

TL;DR

This work tackles the challenge of efficient cross-region training in cloud-edge-end (CEE) environments, where hierarchical network topology and fluctuating bandwidth hinder traditional parallelism. It introduces DeepCEE, a network-centric geo-distributed training system that automatically derives asymmetric parallel strategies through a Heterogeneous Devices Profiler, a Parallel Strategy Planner, and a Dynamic Environment Adapter. By forming two-level device groups, employing compact zero-bubble pipeline parallelism, and dynamically adjusting micro-batches during network fluctuations, DeepCEE achieves substantial throughput improvements (1.3–2.8× over SOTA) and robust adaptation under changing network conditions. The findings demonstrate the practical potential of exploiting idle edge GPUs and heterogeneous networks for scalable, efficient distributed training in real-world multi-region deployments.

Abstract

Most existing training systems focus on a single region. In contrast, we envision that cross-region training offers more flexible GPU resource allocation and yields significant potential. However, the hierarchical cluster topology and unstable networks in the cloud-edge-end (CEE) environment, a typical cross-region scenario, pose substantial challenges to building an efficient and autonomous model training system. We propose DeepCEE, a geo-distributed model training system tailored for heterogeneous GPUs and networks in CEE environments. DeepCEE adopts a communication-centric design philosophy to tackle challenges arising from slow and unstable inter-region networks. It begins with a heterogeneous device profiler that identifies and groups devices based on both network and compute characteristics. Leveraging device groups, DeepCEE implements compact, zero-bubble pipeline parallelism, automatically deriving optimal parallel strategies. To further adapt to runtime variability, DeepCEE integrates a dynamic environment adapter that reacts to network fluctuations. Extensive evaluations demonstrate that DeepCEE achieves 1.3-2.8x higher training throughput compared to widely used and SOTA training systems.

Paper Structure

This paper contains 23 sections, 1 equation, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: The CEE environment and its workload status.
  • Figure 2: Performance comparison of different methods in a real CEE environment.
  • Figure 3: In the CEE environment, current distributed training strategies face two challenges. The hierarchical cluster topology fundamentally constrains training throughput in the CEE environment, while network fluctuations frequently triggers abrupt performance degradation.
  • Figure 4: The main working components and workflow of DeepCEE. DeepCEE includes the pre-run performance evaluation component Heterogeneous Devices Profiler, the pre-run parallel planning component Parallel Strategy Planner, and the runtime environment adaptation component Dynamic Environment Adapter.
  • Figure 5: An example of heterogeneous device and network grouping in the CEE environment. (FG: first-level network device group; SG: second-level computing device group)
  • ...and 8 more figures