Table of Contents
Fetching ...

AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes

Youshao Xiao, Lin Ju, Zhenglei Zhou, Siyuan Li, Zhaoxin Huan, Dalong Zhang, Rujie Jiang, Lin Wang, Xiaolu Zhang, Lei Liang, Jun Zhou

TL;DR

AntDT introduces a unified, self-adaptive framework for mitigating leader and straggler nodes in data-parallel distributed training. By decoupling data allocation and fault tolerance from mitigation strategies through Stateful Dynamic Data Sharding, Monitor, Controller, and Agent, it enables flexible, pre-defined or custom straggler mitigation actions. Two concrete solutions, AntDT-ND for non-dedicated and AntDT-DD for dedicated clusters, demonstrate on industrial workloads that AntDT can achieve multi-fold training speedups and substantial reductions in production training times, while preserving data integrity and scalability. The framework shows strong hardware-heterogeneity resilience, low overhead, and practical deployment viability, with plans to release open-source code.

Abstract

Many distributed training techniques like Parameter Server and AllReduce have been proposed to take advantage of the increasingly large data and rich features. However, stragglers frequently occur in distributed training due to resource contention and hardware heterogeneity, which significantly hampers the training efficiency. Previous works only address part of the stragglers and could not adaptively solve various stragglers in practice. Additionally, it is challenging to use a systematic framework to address all stragglers because different stragglers require diverse data allocation and fault-tolerance mechanisms. Therefore, this paper proposes a unified distributed training framework called AntDT (Ant Distributed Training Framework) to adaptively solve the straggler problems. Firstly, the framework consists of four components, including the Stateful Dynamic Data Sharding service, Monitor, Controller, and Agent. These components work collaboratively to efficiently distribute workloads and provide a range of pre-defined straggler mitigation methods with fault tolerance, thereby hiding messy details of data allocation and fault handling. Secondly, the framework provides a high degree of flexibility, allowing for the customization of straggler mitigation solutions based on the specific circumstances of the cluster. Leveraging this flexibility, we introduce two straggler mitigation solutions, namely AntDT-ND for non-dedicated clusters and AntDT-DD for dedicated clusters, as practical examples to resolve various types of stragglers at Ant Group. Justified by our comprehensive experiments and industrial deployment statistics, AntDT outperforms other SOTA methods more than 3x in terms of training efficiency. Additionally, in Alipay's homepage recommendation scenario, using AntDT reduces the training duration of the ranking model from 27.8 hours to just 5.4 hours.

AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes

TL;DR

AntDT introduces a unified, self-adaptive framework for mitigating leader and straggler nodes in data-parallel distributed training. By decoupling data allocation and fault tolerance from mitigation strategies through Stateful Dynamic Data Sharding, Monitor, Controller, and Agent, it enables flexible, pre-defined or custom straggler mitigation actions. Two concrete solutions, AntDT-ND for non-dedicated and AntDT-DD for dedicated clusters, demonstrate on industrial workloads that AntDT can achieve multi-fold training speedups and substantial reductions in production training times, while preserving data integrity and scalability. The framework shows strong hardware-heterogeneity resilience, low overhead, and practical deployment viability, with plans to release open-source code.

Abstract

Many distributed training techniques like Parameter Server and AllReduce have been proposed to take advantage of the increasingly large data and rich features. However, stragglers frequently occur in distributed training due to resource contention and hardware heterogeneity, which significantly hampers the training efficiency. Previous works only address part of the stragglers and could not adaptively solve various stragglers in practice. Additionally, it is challenging to use a systematic framework to address all stragglers because different stragglers require diverse data allocation and fault-tolerance mechanisms. Therefore, this paper proposes a unified distributed training framework called AntDT (Ant Distributed Training Framework) to adaptively solve the straggler problems. Firstly, the framework consists of four components, including the Stateful Dynamic Data Sharding service, Monitor, Controller, and Agent. These components work collaboratively to efficiently distribute workloads and provide a range of pre-defined straggler mitigation methods with fault tolerance, thereby hiding messy details of data allocation and fault handling. Secondly, the framework provides a high degree of flexibility, allowing for the customization of straggler mitigation solutions based on the specific circumstances of the cluster. Leveraging this flexibility, we introduce two straggler mitigation solutions, namely AntDT-ND for non-dedicated clusters and AntDT-DD for dedicated clusters, as practical examples to resolve various types of stragglers at Ant Group. Justified by our comprehensive experiments and industrial deployment statistics, AntDT outperforms other SOTA methods more than 3x in terms of training efficiency. Additionally, in Alipay's homepage recommendation scenario, using AntDT reduces the training duration of the ranking model from 27.8 hours to just 5.4 hours.
Paper Structure (41 sections, 2 equations, 18 figures, 3 tables)

This paper contains 41 sections, 2 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: BPT(s) among six workers.
  • Figure 2: BPT(s) among four servers.
  • Figure 4: Job completion time (JCT) between BSP and ASP in dedicated and non-dedicated CPU clusters using XDeepFM lian2018xdeepfm model.
  • Figure 5: Data consumption and local throughput among workers in ASP of Parameter Server in the non-dedicated CPU cluster.
  • Figure 6: Overview of AntDT Framework
  • ...and 13 more figures