Table of Contents
Fetching ...

Shared-Memory Hierarchical Process Mapping

Christian Schulz, Henning Woydt

TL;DR

GPMP seeks to map $n$ tasks onto $k$ processing elements in hierarchical supercomputer topologies to minimize total communication cost while balancing workload, a problem that is NP-hard. The authors introduce SharedMap, a parallel shared-memory hierarchical multisection algorithm that partitions the communication graph along the system hierarchy and then uses an identity mapping, enabling scalable, high-quality mappings. Key contributions include four threading strategies (Layer, Priority Queue, Non-Blocking Layer, Naive) and an adaptive imbalance scheme $\epsilon'$ to preserve $\,\epsilon$-balance across hierarchy levels; in experiments on diverse benchmarks, SharedMap achieves best-known solution quality on $95\%$ of instances and outperforms state-of-the-art parallel methods in most cases. The approach is practical, improves mapping quality with competitive runtimes, and opens avenues for GPU extension and dynamic adaptive mapping in evolving workloads.

Abstract

Modern large-scale scientific applications consist of thousands to millions of individual tasks. These tasks involve not only computation but also communication with one another. Typically, the communication pattern between tasks is sparse and can be determined in advance. Such applications are executed on supercomputers, which are often organized in a hierarchical hardware topology, consisting of islands, racks, nodes, and processors, where processing elements reside. To ensure efficient workload distribution, tasks must be allocated to processing elements in a way that ensures balanced utilization. However, this approach optimizes only the workload, not the communication cost of the application. It is straightforward to see that placing groups of tasks that frequently exchange large amounts of data on processing elements located near each other is beneficial. The problem of mapping tasks to processing elements considering optimization goals is called process mapping. In this work, we focus on minimizing communication cost while evenly distributing work. We present the first shared-memory algorithm that utilizes hierarchical multisection to partition the communication model across processing elements. Our parallel approach achieves the best solution on 95 percent of instances while also being marginally faster than the next best algorithm. Even in a serial setting, it delivers the best solution quality while also outperforming previous serial algorithms in speed.

Shared-Memory Hierarchical Process Mapping

TL;DR

GPMP seeks to map tasks onto processing elements in hierarchical supercomputer topologies to minimize total communication cost while balancing workload, a problem that is NP-hard. The authors introduce SharedMap, a parallel shared-memory hierarchical multisection algorithm that partitions the communication graph along the system hierarchy and then uses an identity mapping, enabling scalable, high-quality mappings. Key contributions include four threading strategies (Layer, Priority Queue, Non-Blocking Layer, Naive) and an adaptive imbalance scheme to preserve -balance across hierarchy levels; in experiments on diverse benchmarks, SharedMap achieves best-known solution quality on of instances and outperforms state-of-the-art parallel methods in most cases. The approach is practical, improves mapping quality with competitive runtimes, and opens avenues for GPU extension and dynamic adaptive mapping in evolving workloads.

Abstract

Modern large-scale scientific applications consist of thousands to millions of individual tasks. These tasks involve not only computation but also communication with one another. Typically, the communication pattern between tasks is sparse and can be determined in advance. Such applications are executed on supercomputers, which are often organized in a hierarchical hardware topology, consisting of islands, racks, nodes, and processors, where processing elements reside. To ensure efficient workload distribution, tasks must be allocated to processing elements in a way that ensures balanced utilization. However, this approach optimizes only the workload, not the communication cost of the application. It is straightforward to see that placing groups of tasks that frequently exchange large amounts of data on processing elements located near each other is beneficial. The problem of mapping tasks to processing elements considering optimization goals is called process mapping. In this work, we focus on minimizing communication cost while evenly distributing work. We present the first shared-memory algorithm that utilizes hierarchical multisection to partition the communication model across processing elements. Our parallel approach achieves the best solution on 95 percent of instances while also being marginally faster than the next best algorithm. Even in a serial setting, it delivers the best solution quality while also outperforming previous serial algorithms in speed.

Paper Structure

This paper contains 19 sections, 1 theorem, 2 equations, 6 figures, 1 table, 3 algorithms.

Key Result

Lemma 5.1

Let $G=(V,E)$ be the graph to be partitioned, $\epsilon$ the allowed imbalance and $k = \prod_{i=1}^{\ell} a_i$ the number of partitions, with the $a_i$$'$s describing the hierarchy. Let $G'=(V',E')$ be the subgraph to be partitioned, $d$ the depth in the hierarchy (where the original graph $G_{\mat as the adaptive imbalance parameter to partition $G'$ into $a_d$ partitions ensures that the final

Figures (6)

  • Figure 1: The hierarchical multisection approach with hierarchy $H=4:2:3$ and $D=1:10:100$. On the left-hand side, the partitioning of $G_{\mathcal{C}}$ into $k = 4\cdot 2 \cdot 3 = 24$ blocks is shown. First $G_{\mathcal{C}}$ is partitioned into three blocks ($G_{1}^2$, $G_{2}^2$, $G_{3}^2$), each of the blocks is further partitioned into two blocks ($G_{1}^1$ to $G_{6}^1$), and finally, each of these is partitioned into four blocks ($G_1$ to $G_{24}$). On the right side, the resulting partitioning and the corresponding communication graph are depicted. Solid lines indicate a communication factor of 1 between communicating tasks, dashed lines indicate a factor of 10, and the dotted lines indicate a factor of 100. For example, if a task in $G_1$ communicates with a task in $G_4$, the cost is scaled by 1. If it communicates with a task in $G_6$ the cost is scaled by 10 and if it communicates with a task in $G_{14}$ the cost is scaled by 100.
  • Figure 5: Solution quality (left) and speedup over Strong-1 (right) for the 1 and 80 threaded Non-Blocking LayerFast/Eco/Strong configurations.
  • Figure 6: Runtime comparison for 80 threads between Naive, Layer, Queue and Non-Blocking Layer on small graphs (left) and large graphs (right).
  • Figure 7: Comparing scalability of Non-Blocking Layer with the Strong configuration on small graphs (left) and large graphs (right).
  • Figure 8: Solution quality (left) and speedup over SharedMap-S for the various parallel implementations. All algorithms are run with 80 threads.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Lemma 5.1: Adaptive Imbalance