Table of Contents
Fetching ...

Slicing Is All You Need: Towards A Universal One-Sided Algorithm for Distributed Matrix Multiplication

Benjamin Brock, Renato Golin

TL;DR

This work addresses the fragmentation of distributed matrix multiplication algorithms by proposing a universal one-sided approach that supports all partitionings and replication factors. The method leverages slicing to enumerate local tile multiplies and can execute directly or lower to an optimized IR, implemented in a C++ PGAS framework with direct GPU-to-GPU communication. It demonstrates competitive performance against PyTorch DTensor across GPT-like MLP workloads and a broad set of partitionings, validating the practicality of a single algorithm for diverse distributions. This approach enables broader design-space exploration with potential impact on AI training and large-scale scientific computing by reducing implementation burden and improving communication–computation overlap in distributed environments.

Abstract

Many important applications across science, data analytics, and AI workloads depend on distributed matrix multiplication. Prior work has developed a large array of algorithms suitable for different problem sizes and partitionings including 1D, 2D, 1.5D, and 2.5D algorithms. A limitation of current work is that existing algorithms are limited to a subset of partitionings. Multiple algorithm implementations are required to support the full space of possible partitionings. If no algorithm implementation is available for a particular set of partitionings, one or more operands must be redistributed, increasing communication costs. This paper presents a universal one-sided algorithm for distributed matrix multiplication that supports all combinations of partitionings and replication factors. Our algorithm uses slicing (index arithmetic) to compute the sets of overlapping tiles that must be multiplied together. This list of local matrix multiplies can then either be executed directly, or reordered and lowered to an optimized IR to maximize overlap. We implement our algorithm using a high-level C++-based PGAS programming framework that performs direct GPU-to-GPU communication using intra-node interconnects. We evaluate performance for a wide variety of partitionings and replication factors, finding that our work is competitive with PyTorch DTensor, a highly optimized distributed tensor library targeting AI models.

Slicing Is All You Need: Towards A Universal One-Sided Algorithm for Distributed Matrix Multiplication

TL;DR

This work addresses the fragmentation of distributed matrix multiplication algorithms by proposing a universal one-sided approach that supports all partitionings and replication factors. The method leverages slicing to enumerate local tile multiplies and can execute directly or lower to an optimized IR, implemented in a C++ PGAS framework with direct GPU-to-GPU communication. It demonstrates competitive performance against PyTorch DTensor across GPT-like MLP workloads and a broad set of partitionings, validating the practicality of a single algorithm for diverse distributions. This approach enables broader design-space exploration with potential impact on AI training and large-scale scientific computing by reducing implementation burden and improving communication–computation overlap in distributed environments.

Abstract

Many important applications across science, data analytics, and AI workloads depend on distributed matrix multiplication. Prior work has developed a large array of algorithms suitable for different problem sizes and partitionings including 1D, 2D, 1.5D, and 2.5D algorithms. A limitation of current work is that existing algorithms are limited to a subset of partitionings. Multiple algorithm implementations are required to support the full space of possible partitionings. If no algorithm implementation is available for a particular set of partitionings, one or more operands must be redistributed, increasing communication costs. This paper presents a universal one-sided algorithm for distributed matrix multiplication that supports all combinations of partitionings and replication factors. Our algorithm uses slicing (index arithmetic) to compute the sets of overlapping tiles that must be multiplied together. This list of local matrix multiplies can then either be executed directly, or reordered and lowered to an optimized IR to maximize overlap. We implement our algorithm using a high-level C++-based PGAS programming framework that performs direct GPU-to-GPU communication using intra-node interconnects. We evaluate performance for a wide variety of partitionings and replication factors, finding that our work is competitive with PyTorch DTensor, a highly optimized distributed tensor library targeting AI models.

Paper Structure

This paper contains 19 sections, 3 figures, 2 tables, 2 algorithms.

Figures (3)

  • Figure 1: Our universal one-sided algorithm for distributed matrix multiplication takes in three matrices $C = AB$ with any partitioning. This diagram shows Stationary C data movement with intentionally misaligned tiles to illustrate the algorithm's generality. ① We perform slicing to identify tiles of A (yellow region) and B (grey region) that overlap with our process' stationary tile of C (blue region). This produces a list of local matrix multiply operations ② that must be performed. Note that these tiles are not required to be aligned. We then compute the result either by ③ directly executing this list or by ④ reordering and lowering to an optimized IR to maximize overlap.
  • Figure 2: Experiments on an Intel PVC system comparing our methods versus DTensor for matrix multiplications with dimensions reflective of the MLP layer in a GPT-like transformer. (MLP-1: $m = \textnormal{batch size}$, $n = 48\textnormal{K}$, $k = 12\textnormal{K}$, MLP-2: $m = \textnormal{batch size}$, $n = 12\textnormal{K}$, $k = 48\textnormal{K}$.) Replication factors are plotted above each result. For MLP-2, where we used mixed replication factors, the replication factor for $A$ and $B$ is shown before the dash and replication factor for $C$ is shown after the dash. S-C refers to Stationary C data movement, while S-B refers to Stationary B data movement.
  • Figure 3: Experiments on an Nvidia H100 system comparing our methods versus DTensor for matrix multiplications with dimensions reflective of the MLP layer in a GPT-like transformer. (MLP-1: $m = \textnormal{batch size}$, $n = 48\textnormal{K}$, $k = 12\textnormal{K}$, MLP-2: $m = \textnormal{batch size}$, $n = 12\textnormal{K}$, $k = 48\textnormal{K}$.) Replication factors are plotted above each result. For MLP-2, where we used mixed replication factors, the replication factor for $A$ and $B$ is shown before the dash and replication factor for $C$ is shown after the dash. S-C refers to Stationary C data movement, while S-B refers to Stationary B data movement.