Table of Contents
Fetching ...

DOPPLER: Dual-Policy Learning for Device Assignment in Asynchronous Dataflow Graphs

Xinyu Yao, Daniel Bourgeois, Abhinav Jain, Yuxin Tang, Jiawen Yao, Zhimin Ding, Arlei Silva, Chris Jermaine

TL;DR

Doppler tackles the problem of assigning dataflow graph operations to GPUs under a work-conserving scheduler to minimize execution time in multi-GPU systems. It introduces a dual-policy framework with SEL and PLC networks and a three-stage training pipeline—imitation learning, simulation-based RL, and real-system RL—to enable fast convergence and online refinement. Empirical results show substantial runtime reductions across diverse graph types and hardware, with strong transferability and improved load balancing through learned node traversal and placement. The approach reduces data transfers and improves GPU utilization, offering a practical path for efficient deployment of ML workloads on scalable GPU platforms.

Abstract

We study the problem of assigning operations in a dataflow graph to devices to minimize execution time in a work-conserving system, with emphasis on complex machine learning workloads. Prior learning-based methods often struggle due to three key limitations: (1) reliance on bulk-synchronous systems like TensorFlow, which under-utilize devices due to barrier synchronization; (2) lack of awareness of the scheduling mechanism of underlying systems when designing learning-based methods; and (3) exclusive dependence on reinforcement learning, ignoring the structure of effective heuristics designed by experts. In this paper, we propose \textsc{Doppler}, a three-stage framework for training dual-policy networks consisting of 1) a $\mathsf{SEL}$ policy for selecting operations and 2) a $\mathsf{PLC}$ policy for placing chosen operations on devices. Our experiments show that \textsc{Doppler} outperforms all baseline methods across tasks by reducing system execution time and additionally demonstrates sampling efficiency by reducing per-episode training time.

DOPPLER: Dual-Policy Learning for Device Assignment in Asynchronous Dataflow Graphs

TL;DR

Doppler tackles the problem of assigning dataflow graph operations to GPUs under a work-conserving scheduler to minimize execution time in multi-GPU systems. It introduces a dual-policy framework with SEL and PLC networks and a three-stage training pipeline—imitation learning, simulation-based RL, and real-system RL—to enable fast convergence and online refinement. Empirical results show substantial runtime reductions across diverse graph types and hardware, with strong transferability and improved load balancing through learned node traversal and placement. The approach reduces data transfers and improves GPU utilization, offering a practical path for efficient deployment of ML workloads on scalable GPU platforms.

Abstract

We study the problem of assigning operations in a dataflow graph to devices to minimize execution time in a work-conserving system, with emphasis on complex machine learning workloads. Prior learning-based methods often struggle due to three key limitations: (1) reliance on bulk-synchronous systems like TensorFlow, which under-utilize devices due to barrier synchronization; (2) lack of awareness of the scheduling mechanism of underlying systems when designing learning-based methods; and (3) exclusive dependence on reinforcement learning, ignoring the structure of effective heuristics designed by experts. In this paper, we propose \textsc{Doppler}, a three-stage framework for training dual-policy networks consisting of 1) a policy for selecting operations and 2) a policy for placing chosen operations on devices. Our experiments show that \textsc{Doppler} outperforms all baseline methods across tasks by reducing system execution time and additionally demonstrates sampling efficiency by reducing per-episode training time.

Paper Structure

This paper contains 42 sections, 7 equations, 21 figures, 9 tables, 4 algorithms.

Figures (21)

  • Figure 1: Execution time (in milliseconds) for execution in a work-conserving system (WC) system and a synchronous system. Configuration details can be found in Appendix \ref{['synchrnous-system-runtime-details']}.
  • Figure 1: A decomposition of a matrix multiplication chain. (b) A dataflow graph corresponding to the decomposed chain from (a). Colors show the mapping of computations to GPUs.
  • Figure 2: Assign$(\mathsf{SEL}_{\theta}, \mathsf{PLC}_{\theta})$
  • Figure 3: Real engine execution times (in milliseconds) for Doppler-sys using different combinations of three training stages for the Llamma-layer dataflow graph.
  • Figure 4: Assignments for Ffnn found by Doppler and Placeto. Colors show the mapping of computations to GPUs. Doppler is more effective at load balancing across GPUs.
  • ...and 16 more figures