Table of Contents
Fetching ...

Data-aware Dynamic Execution of Irregular Workloads on Heterogeneous Systems

Zhenyu Bai, Dan Wu, Pranav Dangi, Dhananjaya Wijerathne, Venkata Pavan Kumar Miriyala, Tulika Mitra

TL;DR

This work addresses dynamic workload management on heterogeneous systems containing specialized accelerators (GPUs and FPGAs) for irregular data patterns. It introduces DyPe, a data-aware, DP-based scheduling framework that simultaneously optimizes throughput and energy by partitioning, deploying, and rescheduling kernels across devices while accounting for inter-device data transfers. Key contributions include a multi-objective design-space navigator, an accurate kernel performance model, and a proof-of-concept FPGA-GPU P2P system validating substantial improvements: average $1.53\times$ throughput and $1.09\times$ energy efficiency over the static baseline, and $1.44\times$ throughput and $1.66\times$ energy efficiency over the GPU-only baseline. DyPe demonstrates robust performance across GNN and sliding-window transformer workloads, enabling effective energy-performance trade-offs in heterogeneous hardware for sparse and irregular computations.

Abstract

Current approaches to scheduling workloads on heterogeneous systems with specialized accelerators often rely on manual partitioning, offloading tasks with specific compute patterns to accelerators. This method requires extensive experimentation and human effort to identify the tasks suitable for the accelerator. To solve this problem, we introduce DyPe, a scheduling framework tailored for heterogeneous systems with specialized accelerators. Our method automatically partitions, deploys, and reschedules execution when necessary by dynamically analyzing the characteristics of the input data and leveraging the interoperator parallelism among heterogeneous devices. DyPe navigates a multi-objective, multi-constraint design space that considers both system constraints and application requirements, which allows it to discover Pareto-optimal mapping configurations, improving the system's overall performance and effectively managing energy-performance trade-offs. To demonstrate the benefits of our approach on real hardware, we build a heterogeneous system of GPUs and FPGAs with peer-to-peer data transfers. The experiments show that conventional static scheduling is optimal for 13 out of 86 cases for different workloads and system settings while DyPe is adaptable and able to find the optimal schedule in 77 out of 86 cases, with an average of only 3.95% performance or energy efficiency loss in the sub-optimal cases. Performance evaluation of DyPe shows an average of 1.53x throughput and 1.09x energy efficiency improvement over the static schedule baseline and 1.44x throughput and 1.66x energy efficiency over the GPU-only baseline.

Data-aware Dynamic Execution of Irregular Workloads on Heterogeneous Systems

TL;DR

This work addresses dynamic workload management on heterogeneous systems containing specialized accelerators (GPUs and FPGAs) for irregular data patterns. It introduces DyPe, a data-aware, DP-based scheduling framework that simultaneously optimizes throughput and energy by partitioning, deploying, and rescheduling kernels across devices while accounting for inter-device data transfers. Key contributions include a multi-objective design-space navigator, an accurate kernel performance model, and a proof-of-concept FPGA-GPU P2P system validating substantial improvements: average throughput and energy efficiency over the static baseline, and throughput and energy efficiency over the GPU-only baseline. DyPe demonstrates robust performance across GNN and sliding-window transformer workloads, enabling effective energy-performance trade-offs in heterogeneous hardware for sparse and irregular computations.

Abstract

Current approaches to scheduling workloads on heterogeneous systems with specialized accelerators often rely on manual partitioning, offloading tasks with specific compute patterns to accelerators. This method requires extensive experimentation and human effort to identify the tasks suitable for the accelerator. To solve this problem, we introduce DyPe, a scheduling framework tailored for heterogeneous systems with specialized accelerators. Our method automatically partitions, deploys, and reschedules execution when necessary by dynamically analyzing the characteristics of the input data and leveraging the interoperator parallelism among heterogeneous devices. DyPe navigates a multi-objective, multi-constraint design space that considers both system constraints and application requirements, which allows it to discover Pareto-optimal mapping configurations, improving the system's overall performance and effectively managing energy-performance trade-offs. To demonstrate the benefits of our approach on real hardware, we build a heterogeneous system of GPUs and FPGAs with peer-to-peer data transfers. The experiments show that conventional static scheduling is optimal for 13 out of 86 cases for different workloads and system settings while DyPe is adaptable and able to find the optimal schedule in 77 out of 86 cases, with an average of only 3.95% performance or energy efficiency loss in the sub-optimal cases. Performance evaluation of DyPe shows an average of 1.53x throughput and 1.09x energy efficiency improvement over the static schedule baseline and 1.44x throughput and 1.66x energy efficiency over the GPU-only baseline.

Paper Structure

This paper contains 20 sections, 8 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Different parallelism patterns.
  • Figure 2: Example pipelined schedules for GCN inference. (a) 2-stages example, (b) same schedule with higher sparsity in SpMM1, and (c) the new optimal schedule considering the sparsity change.
  • Figure 3: The DyPe framework.
  • Figure 4: Example 2-stages pipeline with and without conflict.
  • Figure 5: System hardware
  • ...and 4 more figures