Table of Contents
Fetching ...

Mapple: A Domain-Specific Language for Mapping Distributed Programs

Anjiang Wei, Rohan Yadav, Hang Song, Wonchan Lee, Ke Wang, Alex Aiken

TL;DR

Mapple introduces a high-level DSL for mapping distributed task-based programs to processor hierarchies, addressing the core challenge of dimensionality mismatches between task and processor spaces. The approach centers on transformation primitives, especially decompose, to minimize inter-processor communication, with Mapple mappers translated to the Legion runtime. Empirical evaluation across nine applications shows Mapple achieves a 14× reduction in mapper code and up to 1.34× speedup over expert C++ mappers, while the decompose primitive yields up to 1.83× improvement over standard heuristics. These results demonstrate that Mapple can simplify high-performance mapper development and deliver competitive performance by balancing workload and reducing data movement in distributed environments.

Abstract

Optimizing parallel programs for distributed systems is a complex task, often requiring significant code modifications. Task-based programming systems improve modularity by separating performance decisions from application logic, but their mapping interfaces are low-level. We introduce Mapple, a high-level, declarative programming interface for mapping distributed applications. Mapple provides transformation primitives to resolve dimensionality mismatches between task and processor spaces, including a key primitive, decompose, that helps minimize communication volume. We implement Mapple on top of the Legion runtime by translating Mapple mappers into its low-level C++ interface. Across nine applications, including six matrix multiplication algorithms and three scientific computing workloads, Mapple reduces mapper code size by 14x and enables performance improvements of up to 1.34x over expert-written C++ mappers. In addition, the decompose primitive achieves up to 1.83x improvement over existing dimensionality-resolution heuristics.

Mapple: A Domain-Specific Language for Mapping Distributed Programs

TL;DR

Mapple introduces a high-level DSL for mapping distributed task-based programs to processor hierarchies, addressing the core challenge of dimensionality mismatches between task and processor spaces. The approach centers on transformation primitives, especially decompose, to minimize inter-processor communication, with Mapple mappers translated to the Legion runtime. Empirical evaluation across nine applications shows Mapple achieves a 14× reduction in mapper code and up to 1.34× speedup over expert C++ mappers, while the decompose primitive yields up to 1.83× improvement over standard heuristics. These results demonstrate that Mapple can simplify high-performance mapper development and deliver competitive performance by balancing workload and reducing data movement in distributed environments.

Abstract

Optimizing parallel programs for distributed systems is a complex task, often requiring significant code modifications. Task-based programming systems improve modularity by separating performance decisions from application logic, but their mapping interfaces are low-level. We introduce Mapple, a high-level, declarative programming interface for mapping distributed applications. Mapple provides transformation primitives to resolve dimensionality mismatches between task and processor spaces, including a key primitive, decompose, that helps minimize communication volume. We implement Mapple on top of the Legion runtime by translating Mapple mappers into its low-level C++ interface. Across nine applications, including six matrix multiplication algorithms and three scientific computing workloads, Mapple reduces mapper code size by 14x and enables performance improvements of up to 1.34x over expert-written C++ mappers. In addition, the decompose primitive achieves up to 1.83x improvement over existing dimensionality-resolution heuristics.

Paper Structure

This paper contains 33 sections, 1 theorem, 14 equations, 18 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Given a list of $n$ positive numbers $a_1, a_2, \ldots, a_n$, the following inequality holds: with equality if and only if $a_1 = a_2 = \cdots = a_n$.

Figures (18)

  • Figure 1: Comparison between a Mapple mapper and its partial C++ counterpart. The Mapple mapper uses a high-level, declarative design that abstracts away the complexity of low-level C++ implementations while still supporting performance optimization. The boxed block2d function is realized through two separate APIs in the C++ mapper, illustrating the conciseness of Mapple.
  • Figure 2: Three existing interface designs for mapping task space to processor space: the enumeration-based, keyword-based, and programmatic approaches.
  • Figure 3: Block mapping from the task space $(6, 6)$ to the processor space $(2, 2)$, a machine with 2 nodes and 2 GPUs per node. A node index and a GPU index within the node name a specific GPU processor. The shaded index point $(2, 3)$ is mapped to node 0 and GPU 1.
  • Figure 4: A custom cyclic distribution. The merge primitive transforms the 2D processor space into a 1D space. The mapping function linearizes each 2D index point and applies a round-robin distribution over the resulting 1D processor space.
  • Figure 5: A mapper illustrating the dimensionality mismatch between task and processor spaces in the Solomonik's algorithm on a 2-node, 4-GPU-per-node machine. The original 2D processor space is transformed to 6D via the split primitive, shown as two 3D spaces.
  • ...and 13 more figures

Theorems & Definitions (1)

  • Theorem