AllReduce Scheduling with Hierarchical Deep Reinforcement Learning
Yufan Wei, Mickel Liu, Wenfei Wu
TL;DR
This work addresses inflexible AllReduce scheduling in distributed ML by proposing a DRL-based pipeline that generalizes across network topologies. It introduces two hierarchically structured POMDPs forming a hierarchical RL framework: a high-level Flow-Tree Selection agent guides the lower-level Workload Scheduling agent to produce valid, efficient schedules. A flow-level simulator with topology generation for BCube, DCell, and Jellyfish, along with a workload-tree and a merge operation, enables end-to-end evaluation and optimization of scheduling decisions. Empirical results show the RL-based approach outperforms traditional Parameter Server and Ring AllReduce on several topologies, highlighting the value of topology-aware, learned scheduling and the potential for broader applicability in DCN flow optimization.
Abstract
AllReduce is a technique in distributed computing which saw use in many critical applications of deep learning. Existing methods of AllReduce scheduling oftentimes lack flexibility due to being topology-specific or relying on extensive handcrafted designs that require domain-specific knowledge. In this work, we aim to alleviate this inflexibility by proposing a deep-reinforcement-learning (DRL)-based pipeline that can generate AllReduce scheduling for various network topologies without topology-specific design features. The flow scheduling module of this pipeline consists of two hierarchically-structured DRL policies that work cooperatively to find optimal scheduling. We showcase the performance of our method compared to the baseline methods on three topologies: BCube, DCell, and Jellyfish. Finally, we contributed a Python-based simulation environment simulating AllReduce scheduling on these network topologies.
