Slicing Is All You Need: Towards A Universal One-Sided Algorithm for Distributed Matrix Multiplication
Benjamin Brock, Renato Golin
TL;DR
This work addresses the fragmentation of distributed matrix multiplication algorithms by proposing a universal one-sided approach that supports all partitionings and replication factors. The method leverages slicing to enumerate local tile multiplies and can execute directly or lower to an optimized IR, implemented in a C++ PGAS framework with direct GPU-to-GPU communication. It demonstrates competitive performance against PyTorch DTensor across GPT-like MLP workloads and a broad set of partitionings, validating the practicality of a single algorithm for diverse distributions. This approach enables broader design-space exploration with potential impact on AI training and large-scale scientific computing by reducing implementation burden and improving communication–computation overlap in distributed environments.
Abstract
Many important applications across science, data analytics, and AI workloads depend on distributed matrix multiplication. Prior work has developed a large array of algorithms suitable for different problem sizes and partitionings including 1D, 2D, 1.5D, and 2.5D algorithms. A limitation of current work is that existing algorithms are limited to a subset of partitionings. Multiple algorithm implementations are required to support the full space of possible partitionings. If no algorithm implementation is available for a particular set of partitionings, one or more operands must be redistributed, increasing communication costs. This paper presents a universal one-sided algorithm for distributed matrix multiplication that supports all combinations of partitionings and replication factors. Our algorithm uses slicing (index arithmetic) to compute the sets of overlapping tiles that must be multiplied together. This list of local matrix multiplies can then either be executed directly, or reordered and lowered to an optimized IR to maximize overlap. We implement our algorithm using a high-level C++-based PGAS programming framework that performs direct GPU-to-GPU communication using intra-node interconnects. We evaluate performance for a wide variety of partitionings and replication factors, finding that our work is competitive with PyTorch DTensor, a highly optimized distributed tensor library targeting AI models.
