Table of Contents
Fetching ...

Optimizing Distributed ML Communication with Fused Computation-Collective Operations

Kishore Punniyamurthy, Khaled Hamidouche, Bradford M. Beckmann

TL;DR

This paper addresses the latency bottleneck of collective communications in distributed ML by fusing dependent computation with network transfers inside persistent GPU kernels, enabling fine-grained overlap between computation and communication. It introduces three fused operators—Embedding+All-to-All, GEMV+AllReduce, and GEMM+All-to-All—with scale-out and scale-up variants, implemented as PyTorch operators and extended Triton primitives to support GPU-initiated networking. Empirical results show meaningful reductions in execution time: approximately 20% on average for scale-up embedding+All-to-All, 13% for GEMV+AllReduce, and 12% for GEMM+All-to-All, with inter-node improvements reaching around 31% on average for embedding+All-to-All; large-scale simulations indicate up to 21% total training-time reduction on 128-node DLRM. The approach leverages GPU-initiated intra-kernel communication, cache-flush features, and zero-copy data paths to minimize buffering and kernel-launch overhead, and demonstrates practical integration with PyTorch and Triton to enable broader adoption in production ML workloads.

Abstract

In order to satisfy their ever increasing capacity and compute requirements, machine learning models are distributed across multiple nodes using numerous parallelism strategies. As a result, collective communications are often on the critical path, and hiding their latency by overlapping kernel-granular communication and computation is difficult due to the absence of independent computation. In this work, we propose fusing computation with dependent collective communication by leveraging GPUs' massive parallelism and GPU-initiated communication. We have developed self-contained GPU kernels where workgroups (WGs) immediately communicate their results to remote GPUs when they complete their computation. Meanwhile, other WGs within the same kernel perform overlapping computation, maintaining high ALU utilization. We demonstrate our approach by creating three prototype fused operators (embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address the pervasive communication overheads observed in DLRM, Transformers and MoE model architectures. In order to demonstrate that our approach can be integrated into ML frameworks for wide adoption in production environments, we expose our fused operators as new PyTorch operators as well as extend the Triton framework to enable them. Our evaluations show that our approach can effectively overlap communication with computations, subsequently reducing their combined execution time than the current collective library-based approaches. Our scale-up GEMV + AllReduce and GEMM + All-to-All implementations achieve up to 22% and 20% lower execution time, while our fused embedding + All-to-All reduces execution time by 20% and 31% for intra-node and inter-node configurations. Large scale-out simulations indicate that our approach reduces DLRM execution time by 21% for 128 node system.

Optimizing Distributed ML Communication with Fused Computation-Collective Operations

TL;DR

This paper addresses the latency bottleneck of collective communications in distributed ML by fusing dependent computation with network transfers inside persistent GPU kernels, enabling fine-grained overlap between computation and communication. It introduces three fused operators—Embedding+All-to-All, GEMV+AllReduce, and GEMM+All-to-All—with scale-out and scale-up variants, implemented as PyTorch operators and extended Triton primitives to support GPU-initiated networking. Empirical results show meaningful reductions in execution time: approximately 20% on average for scale-up embedding+All-to-All, 13% for GEMV+AllReduce, and 12% for GEMM+All-to-All, with inter-node improvements reaching around 31% on average for embedding+All-to-All; large-scale simulations indicate up to 21% total training-time reduction on 128-node DLRM. The approach leverages GPU-initiated intra-kernel communication, cache-flush features, and zero-copy data paths to minimize buffering and kernel-launch overhead, and demonstrates practical integration with PyTorch and Triton to enable broader adoption in production ML workloads.

Abstract

In order to satisfy their ever increasing capacity and compute requirements, machine learning models are distributed across multiple nodes using numerous parallelism strategies. As a result, collective communications are often on the critical path, and hiding their latency by overlapping kernel-granular communication and computation is difficult due to the absence of independent computation. In this work, we propose fusing computation with dependent collective communication by leveraging GPUs' massive parallelism and GPU-initiated communication. We have developed self-contained GPU kernels where workgroups (WGs) immediately communicate their results to remote GPUs when they complete their computation. Meanwhile, other WGs within the same kernel perform overlapping computation, maintaining high ALU utilization. We demonstrate our approach by creating three prototype fused operators (embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address the pervasive communication overheads observed in DLRM, Transformers and MoE model architectures. In order to demonstrate that our approach can be integrated into ML frameworks for wide adoption in production environments, we expose our fused operators as new PyTorch operators as well as extend the Triton framework to enable them. Our evaluations show that our approach can effectively overlap communication with computations, subsequently reducing their combined execution time than the current collective library-based approaches. Our scale-up GEMV + AllReduce and GEMM + All-to-All implementations achieve up to 22% and 20% lower execution time, while our fused embedding + All-to-All reduces execution time by 20% and 31% for intra-node and inter-node configurations. Large scale-out simulations indicate that our approach reduces DLRM execution time by 21% for 128 node system.
Paper Structure (16 sections, 15 figures, 2 tables)

This paper contains 16 sections, 15 figures, 2 tables.

Figures (15)

  • Figure 1: System architecture trends.
  • Figure 2: DLRM forward pass.
  • Figure 3: Model parallelism in Transformer MLP layer megatronlm.
  • Figure 4: MoE layers distributed across GPUs tutel.
  • Figure 5: Kernel boundary vs intra-kernel communication
  • ...and 10 more figures