Data-aware Dynamic Execution of Irregular Workloads on Heterogeneous Systems

Zhenyu Bai; Dan Wu; Pranav Dangi; Dhananjaya Wijerathne; Venkata Pavan Kumar Miriyala; Tulika Mitra

Data-aware Dynamic Execution of Irregular Workloads on Heterogeneous Systems

Zhenyu Bai, Dan Wu, Pranav Dangi, Dhananjaya Wijerathne, Venkata Pavan Kumar Miriyala, Tulika Mitra

TL;DR

This work addresses dynamic workload management on heterogeneous systems containing specialized accelerators (GPUs and FPGAs) for irregular data patterns. It introduces DyPe, a data-aware, DP-based scheduling framework that simultaneously optimizes throughput and energy by partitioning, deploying, and rescheduling kernels across devices while accounting for inter-device data transfers. Key contributions include a multi-objective design-space navigator, an accurate kernel performance model, and a proof-of-concept FPGA-GPU P2P system validating substantial improvements: average $1.53\times$ throughput and $1.09\times$ energy efficiency over the static baseline, and $1.44\times$ throughput and $1.66\times$ energy efficiency over the GPU-only baseline. DyPe demonstrates robust performance across GNN and sliding-window transformer workloads, enabling effective energy-performance trade-offs in heterogeneous hardware for sparse and irregular computations.

Abstract

Current approaches to scheduling workloads on heterogeneous systems with specialized accelerators often rely on manual partitioning, offloading tasks with specific compute patterns to accelerators. This method requires extensive experimentation and human effort to identify the tasks suitable for the accelerator. To solve this problem, we introduce DyPe, a scheduling framework tailored for heterogeneous systems with specialized accelerators. Our method automatically partitions, deploys, and reschedules execution when necessary by dynamically analyzing the characteristics of the input data and leveraging the interoperator parallelism among heterogeneous devices. DyPe navigates a multi-objective, multi-constraint design space that considers both system constraints and application requirements, which allows it to discover Pareto-optimal mapping configurations, improving the system's overall performance and effectively managing energy-performance trade-offs. To demonstrate the benefits of our approach on real hardware, we build a heterogeneous system of GPUs and FPGAs with peer-to-peer data transfers. The experiments show that conventional static scheduling is optimal for 13 out of 86 cases for different workloads and system settings while DyPe is adaptable and able to find the optimal schedule in 77 out of 86 cases, with an average of only 3.95% performance or energy efficiency loss in the sub-optimal cases. Performance evaluation of DyPe shows an average of 1.53x throughput and 1.09x energy efficiency improvement over the static schedule baseline and 1.44x throughput and 1.66x energy efficiency over the GPU-only baseline.

Data-aware Dynamic Execution of Irregular Workloads on Heterogeneous Systems

TL;DR

Abstract

Data-aware Dynamic Execution of Irregular Workloads on Heterogeneous Systems

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)