Table of Contents
Fetching ...

AllReduce Scheduling with Hierarchical Deep Reinforcement Learning

Yufan Wei, Mickel Liu, Wenfei Wu

TL;DR

This work addresses inflexible AllReduce scheduling in distributed ML by proposing a DRL-based pipeline that generalizes across network topologies. It introduces two hierarchically structured POMDPs forming a hierarchical RL framework: a high-level Flow-Tree Selection agent guides the lower-level Workload Scheduling agent to produce valid, efficient schedules. A flow-level simulator with topology generation for BCube, DCell, and Jellyfish, along with a workload-tree and a merge operation, enables end-to-end evaluation and optimization of scheduling decisions. Empirical results show the RL-based approach outperforms traditional Parameter Server and Ring AllReduce on several topologies, highlighting the value of topology-aware, learned scheduling and the potential for broader applicability in DCN flow optimization.

Abstract

AllReduce is a technique in distributed computing which saw use in many critical applications of deep learning. Existing methods of AllReduce scheduling oftentimes lack flexibility due to being topology-specific or relying on extensive handcrafted designs that require domain-specific knowledge. In this work, we aim to alleviate this inflexibility by proposing a deep-reinforcement-learning (DRL)-based pipeline that can generate AllReduce scheduling for various network topologies without topology-specific design features. The flow scheduling module of this pipeline consists of two hierarchically-structured DRL policies that work cooperatively to find optimal scheduling. We showcase the performance of our method compared to the baseline methods on three topologies: BCube, DCell, and Jellyfish. Finally, we contributed a Python-based simulation environment simulating AllReduce scheduling on these network topologies.

AllReduce Scheduling with Hierarchical Deep Reinforcement Learning

TL;DR

This work addresses inflexible AllReduce scheduling in distributed ML by proposing a DRL-based pipeline that generalizes across network topologies. It introduces two hierarchically structured POMDPs forming a hierarchical RL framework: a high-level Flow-Tree Selection agent guides the lower-level Workload Scheduling agent to produce valid, efficient schedules. A flow-level simulator with topology generation for BCube, DCell, and Jellyfish, along with a workload-tree and a merge operation, enables end-to-end evaluation and optimization of scheduling decisions. Empirical results show the RL-based approach outperforms traditional Parameter Server and Ring AllReduce on several topologies, highlighting the value of topology-aware, learned scheduling and the potential for broader applicability in DCN flow optimization.

Abstract

AllReduce is a technique in distributed computing which saw use in many critical applications of deep learning. Existing methods of AllReduce scheduling oftentimes lack flexibility due to being topology-specific or relying on extensive handcrafted designs that require domain-specific knowledge. In this work, we aim to alleviate this inflexibility by proposing a deep-reinforcement-learning (DRL)-based pipeline that can generate AllReduce scheduling for various network topologies without topology-specific design features. The flow scheduling module of this pipeline consists of two hierarchically-structured DRL policies that work cooperatively to find optimal scheduling. We showcase the performance of our method compared to the baseline methods on three topologies: BCube, DCell, and Jellyfish. Finally, we contributed a Python-based simulation environment simulating AllReduce scheduling on these network topologies.

Paper Structure

This paper contains 28 sections, 5 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: An overview of our method. The objective is to perform AllReduce operation efficiently with the aid of RL-based decision-making models.
  • Figure 2: A simplified illustration of our proposed pipeline. We use the flow-tree representation to describe the workloads or flow demand of a server node. The upper-level model determines a set of flow-tree to interact with at a given round, which the set defines a pool of candidates for the lower-level model to perform flow scheduling. The lower-level model determines a valid flow scheduling and sends out the flows to proceed environment to the next round.
  • Figure 3: Example for Parameter server and merge operation