Table of Contents
Fetching ...

Evaluating Rapid Makespan Predictions for Heterogeneous Systems with Programmable Logic

Martin Wilhelm, Franz Freitag, Max Tzschoppe, Thilo Pionteck

TL;DR

This work tackles rapid makespan prediction for task mapping in heterogeneous systems comprising CPUs, GPUs, and FPGAs. It introduces an OpenCL-based evaluation framework that generates large numbers of random, annotated task graphs and corresponding kernels, enabling quick prediction-versus-real-run validation without full hardware-specific implementations. The study analyzes the accuracy and practicality of existing analytical approaches, highlighting challenges from data transfer, streaming, and device congestion, and demonstrates that rapid predictions can effectively guide design-space exploration, even when FPGA bitstream generation remains a bottleneck. The framework is publicly available and aims to bridge theory and practice by enabling developers to refine makespan prediction algorithms for complex, dataflow-capable accelerators.

Abstract

Heterogeneous computing systems, which combine general-purpose processors with specialized accelerators, are increasingly important for optimizing the performance of modern applications. A central challenge is to decide which parts of an application should be executed on which accelerator or, more generally, how to map the tasks of an application to available devices. Predicting the impact of a change in a task mapping on the overall makespan is non-trivial. While there are very capable simulators, these generally require a full implementation of the tasks in question, which is particularly time-intensive for programmable logic. A promising alternative is to use a purely analytical function, which allows for very fast predictions, but abstracts significantly from reality. Bridging the gap between theory and practice poses a significant challenge to algorithm developers. This paper aims to aid in the development of rapid makespan prediction algorithms by providing a highly flexible evaluation framework for heterogeneous systems consisting of CPUs, GPUs and FPGAs, which is capable of collecting real-world makespan results based on abstract task graph descriptions. We analyze to what extent actual makespans can be predicted by existing analytical approaches. Furthermore, we present common challenges that arise from high-level characteristics such as data transfer overhead and device congestion in heterogeneous systems.

Evaluating Rapid Makespan Predictions for Heterogeneous Systems with Programmable Logic

TL;DR

This work tackles rapid makespan prediction for task mapping in heterogeneous systems comprising CPUs, GPUs, and FPGAs. It introduces an OpenCL-based evaluation framework that generates large numbers of random, annotated task graphs and corresponding kernels, enabling quick prediction-versus-real-run validation without full hardware-specific implementations. The study analyzes the accuracy and practicality of existing analytical approaches, highlighting challenges from data transfer, streaming, and device congestion, and demonstrates that rapid predictions can effectively guide design-space exploration, even when FPGA bitstream generation remains a bottleneck. The framework is publicly available and aims to bridge theory and practice by enabling developers to refine makespan prediction algorithms for complex, dataflow-capable accelerators.

Abstract

Heterogeneous computing systems, which combine general-purpose processors with specialized accelerators, are increasingly important for optimizing the performance of modern applications. A central challenge is to decide which parts of an application should be executed on which accelerator or, more generally, how to map the tasks of an application to available devices. Predicting the impact of a change in a task mapping on the overall makespan is non-trivial. While there are very capable simulators, these generally require a full implementation of the tasks in question, which is particularly time-intensive for programmable logic. A promising alternative is to use a purely analytical function, which allows for very fast predictions, but abstracts significantly from reality. Bridging the gap between theory and practice poses a significant challenge to algorithm developers. This paper aims to aid in the development of rapid makespan prediction algorithms by providing a highly flexible evaluation framework for heterogeneous systems consisting of CPUs, GPUs and FPGAs, which is capable of collecting real-world makespan results based on abstract task graph descriptions. We analyze to what extent actual makespans can be predicted by existing analytical approaches. Furthermore, we present common challenges that arise from high-level characteristics such as data transfer overhead and device congestion in heterogeneous systems.

Paper Structure

This paper contains 18 sections, 5 figures.

Figures (5)

  • Figure 1: The process flow of the evaluation framework. The framework enables the comparison of a makespan prediction based on an annotated task graph with the real makespan in an equivalent heterogeneous system for arbitrary task graphs.
  • Figure 2: The architecture of an FPGA kernel. In its core, the computation is divided into multiple subtasks according to its streamability. Data exchange with CPU and GPU is done using the RAM, whereas communication with other FPGA kernels is based on AXI streams with FIFO buffers.
  • Figure 3: Three different cases that complicate streaming between tasks. An interposed FPGA node delays the stream, whereas an interposed CPU node makes streaming impossible. An otherwise unrelated incoming CPU node may or may not hinder the streaming depending on its finish time.
  • Figure 4: Predicted and actual execution times for a pure CPU mapping and a mapping derived through simulated annealing for ten task graphs of size $20$.
  • Figure 5: Exemplary mapped task graph with predicted and actual execution times, where the decision whether to stream between two FPGA kernels differs. While the prediction model decides to stream between the tasks, the evaluation framework executes them separately. In each node the complexity $c$, parallelizability $p$ and streamability $s$ is depicted together with the time window of the execution. Differences in the start nodes result from an initialization overhead of the real system.