Table of Contents
Fetching ...

Composing Distributed Computations Through Task and Kernel Fusion

Rohan Yadav, Shiv Sundram, Wonchan Lee, Michael Garland, Michael Bauer, Alex Aiken, Fredrik Kjolstad

TL;DR

Diffuse tackles the challenge of efficiently composing distributed, task-based computations by introducing a scale-free intermediate representation that enables scalable, alias-aware fusion analyses across library boundaries. It pairs distributed task fusion with an MLIR-based JIT to fuse kernels inside fused tasks, yielding significant speedups on unmodified cuNumeric and Legate Sparse workloads and approaching or surpassing hand-tuned, MPI-based baselines. The approach is domain-agnostic, enabling cross-library optimizations and removing distributed-temporary data through temporary-store elimination and analysis memoization. Practically, Diffuse demonstrates 1.86x average speedups on up to 128 GPUs and up to 1.23x speedups over hand-optimized code, while maintaining compilation overhead within reasonable bounds, suggesting broad applicability for scalable, distributed scientific computing.

Abstract

We introduce Diffuse, a system that dynamically performs task and kernel fusion in distributed, task-based runtime systems. The key component of Diffuse is an intermediate representation of distributed computation that enables the necessary analyses for the fusion of distributed tasks to be performed in a scalable manner. We pair task fusion with a JIT compiler to fuse together the kernels within fused tasks. We show empirically that Diffuse's intermediate representation is general enough to be a target for two real-world, task-based libraries (cuNumeric and Legate Sparse), letting Diffuse find optimization opportunities across function and library boundaries. Diffuse accelerates unmodified applications developed by composing task-based libraries by 1.86x on average (geo-mean), and by between 0.93x--10.7x on up to 128 GPUs. Diffuse also finds optimization opportunities missed by the original application developers, enabling high-level Python programs to match or exceed the performance of an explicitly parallel MPI library.

Composing Distributed Computations Through Task and Kernel Fusion

TL;DR

Diffuse tackles the challenge of efficiently composing distributed, task-based computations by introducing a scale-free intermediate representation that enables scalable, alias-aware fusion analyses across library boundaries. It pairs distributed task fusion with an MLIR-based JIT to fuse kernels inside fused tasks, yielding significant speedups on unmodified cuNumeric and Legate Sparse workloads and approaching or surpassing hand-tuned, MPI-based baselines. The approach is domain-agnostic, enabling cross-library optimizations and removing distributed-temporary data through temporary-store elimination and analysis memoization. Practically, Diffuse demonstrates 1.86x average speedups on up to 128 GPUs and up to 1.23x speedups over hand-optimized code, while maintaining compilation overhead within reasonable bounds, suggesting broad applicability for scalable, distributed scientific computing.

Abstract

We introduce Diffuse, a system that dynamically performs task and kernel fusion in distributed, task-based runtime systems. The key component of Diffuse is an intermediate representation of distributed computation that enables the necessary analyses for the fusion of distributed tasks to be performed in a scalable manner. We pair task fusion with a JIT compiler to fuse together the kernels within fused tasks. We show empirically that Diffuse's intermediate representation is general enough to be a target for two real-world, task-based libraries (cuNumeric and Legate Sparse), letting Diffuse find optimization opportunities across function and library boundaries. Diffuse accelerates unmodified applications developed by composing task-based libraries by 1.86x on average (geo-mean), and by between 0.93x--10.7x on up to 128 GPUs. Diffuse also finds optimization opportunities missed by the original application developers, enabling high-level Python programs to match or exceed the performance of an explicitly parallel MPI library.

Paper Structure

This paper contains 25 sections, 1 theorem, 1 equation, 13 figures.

Key Result

Theorem 1

Given a window of tasks $[T_1, \ldots, T_n]$, our task fusion algorithm identifies a prefix $[T_1, \ldots, T_f]$ and produces a fused task $F$ such that

Figures (13)

  • Figure 1: Execution example of Diffuse on a distributed, multi-GPU cuPyNumeric 5-point stencil application.
  • Figure 2: Diffuse's IR exposes a distributed data model and a model for distributed computation on distributed data.
  • Figure 3: Examples of Tiling partitions. Partitions maps points in the denoted domain to sub-stores. Each color refers to the sub-store associated with a each point in the domain.
  • Figure 4: Visualization of dependence maps $\mathcal{D}(T_1, T_2)$.
  • Figure 5: Fusion constraints employed by Diffuse to identify potential communication between index tasks.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1
  • Definition 4