Table of Contents
Fetching ...

Static task mapping for heterogeneous systems based on series-parallel decompositions

Martin Wilhelm, Thilo Pionteck

TL;DR

This work tackles static task mapping in highly heterogeneous systems with many tasks and dependencies by introducing a decomposition-based principle that leverages series-parallel graph structures and a fast model-based cost function. It constructs a forest of series-parallel decomposition trees for general DAGs and provides practical FirstFit heuristics (γ-threshold and basic FirstFit) to guide mappings efficiently. Across synthetic and real-world benchmarks, the approach yields substantial makespan improvements over MILP, GA, HEFT, and PEFT while remaining orders of magnitude faster than genetic algorithms, with the largest benefit when the task graph is at least almost series-parallel. The method enables scalable, high-quality static mappings for complex heterogeneous platforms, including streaming data paths on FPGAs, making it suitable for practical deployment on workload-rich systems.

Abstract

Modern heterogeneous systems consist of many different processing units, such as CPUs, GPUs, FPGAs and AI units. A central problem in the design of applications in this environment is to find a beneficial mapping of tasks to processing units. While there are various approaches to task mapping, few can deal with high heterogeneity or applications with a high number of tasks and many dependencies. In addition, streaming aspects of FPGAs are generally not considered. We present a new general task mapping principle based on graph decompositions and model-based evaluation that can find beneficial mappings regardless of the complexity of the scenario. We apply this principle to create a high-quality and reasonably efficient task mapping algorithm using series-parallel decompositions. For this, we present a new algorithm to compute a forest of series-parallel decomposition trees for general DAGs. We compare our decomposition-based mapping algorithm with three mixed-integer linear programs, one genetic algorithm and two variations of the Heterogeneous Earliest Finish Time (HEFT) algorithm. We show that our approach can generate mappings that lead to substantially higher makespan improvements than the HEFT variations in complex environments while being orders of magnitude faster than a mapper based on genetic algorithms or integer linear programs.

Static task mapping for heterogeneous systems based on series-parallel decompositions

TL;DR

This work tackles static task mapping in highly heterogeneous systems with many tasks and dependencies by introducing a decomposition-based principle that leverages series-parallel graph structures and a fast model-based cost function. It constructs a forest of series-parallel decomposition trees for general DAGs and provides practical FirstFit heuristics (γ-threshold and basic FirstFit) to guide mappings efficiently. Across synthetic and real-world benchmarks, the approach yields substantial makespan improvements over MILP, GA, HEFT, and PEFT while remaining orders of magnitude faster than genetic algorithms, with the largest benefit when the task graph is at least almost series-parallel. The method enables scalable, high-quality static mappings for complex heterogeneous platforms, including streaming data paths on FPGAs, making it suitable for practical deployment on workload-rich systems.

Abstract

Modern heterogeneous systems consist of many different processing units, such as CPUs, GPUs, FPGAs and AI units. A central problem in the design of applications in this environment is to find a beneficial mapping of tasks to processing units. While there are various approaches to task mapping, few can deal with high heterogeneity or applications with a high number of tasks and many dependencies. In addition, streaming aspects of FPGAs are generally not considered. We present a new general task mapping principle based on graph decompositions and model-based evaluation that can find beneficial mappings regardless of the complexity of the scenario. We apply this principle to create a high-quality and reasonably efficient task mapping algorithm using series-parallel decompositions. For this, we present a new algorithm to compute a forest of series-parallel decomposition trees for general DAGs. We compare our decomposition-based mapping algorithm with three mixed-integer linear programs, one genetic algorithm and two variations of the Heterogeneous Earliest Finish Time (HEFT) algorithm. We show that our approach can generate mappings that lead to substantially higher makespan improvements than the HEFT variations in complex environments while being orders of magnitude faster than a mapper based on genetic algorithms or integer linear programs.

Paper Structure

This paper contains 16 sections, 1 equation, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: A directed series-parallel graph and its decomposition tree. A round node indicates a parallel operation, whereas a rectangular node indicates a series operation. The leaves represent an edge in the original graph.
  • Figure 2: An illustration of the cutting step of Alg. \ref{['alg:sppartition']} and the resulting decomposition forest for a non-series-parallel graph. Blue dotted edges indicate a currently active grow_series call whereas red dashed ellipses indicate an active grow_parallel call. Stars in a series operation indicate that the final end node has not yet been found. The decomposition forest on the right side results from cutting the subgraph $1-5$.
  • Figure 3: Comparison between single node and series-parallel decomposition mapping and three integer linear programs for random series-parallel graphs. Data points are generated for each graph size between 5.0 and 30.0 tasks for all algorithms except for ZhouLiu. Due to excessive execution times, data points for ZhouLiu are only available for 5.0, 10.0, 15.0 and 20.0 tasks.
  • Figure 4: Comparison of the relative improvements and the execution times of the list-based scheduling algorithms HEFT and PEFT and the two decomposition strategies SingleNode and SeriesParallel with and without the FirstFit heuristic. Data points are generated for 5.0 to 200 tasks with steps of 5.0 tasks. The execution time is displayed using a logarithmic scale. The execution times for HEFT and PEFT are below 10µs and therefore not displayed.
  • Figure 5: Comparison of the relative improvement and the execution times of the genetic algorithm (NSGAII) and the two decomposition strategies with the FirstFit heuristic. Data points are generated for 5.0 to 100 tasks with steps of 5.0 tasks. The execution time is displayed using a logarithmic scale.
  • ...and 2 more figures