Table of Contents
Fetching ...

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu

TL;DR

Flux tackles the communication bottleneck in tensor-parallel GPU workloads by over-decomposing and fusing communication with GEMM into a single kernel. By implementing ReduceScatter and AllGather as epilogue and prologue fused operations within a GEMM kernel and employing tile coordinate swizzling and auto-tuning, Flux achieves substantial overlap and improved compute utilization across diverse GPU interconnects. Empirical results show notable speedups for both training and inference on 128-GPU and 8-GPU clusters, with higher gains when communication proportions are large. The approach offers a practical path to scalable, high-performance tensor parallelism for large deep learning models.

Abstract

Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

TL;DR

Flux tackles the communication bottleneck in tensor-parallel GPU workloads by over-decomposing and fusing communication with GEMM into a single kernel. By implementing ReduceScatter and AllGather as epilogue and prologue fused operations within a GEMM kernel and employing tile coordinate swizzling and auto-tuning, Flux achieves substantial overlap and improved compute utilization across diverse GPU interconnects. Empirical results show notable speedups for both training and inference on 128-GPU and 8-GPU clusters, with higher gains when communication proportions are large. The approach offers a practical path to scalable, high-performance tensor parallelism for large deep learning models.

Abstract

Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.
Paper Structure (21 sections, 2 equations, 17 figures, 3 algorithms)

This paper contains 21 sections, 2 equations, 17 figures, 3 algorithms.

Figures (17)

  • Figure 1: Non-overlapped communication portion within tensor parallelism in common LLM workloads for training with 2-way data, 8-way pipeline, 8-way tensor parallelism on various 128-GPU clusters, and inference with 8-way tensor parallelism on various 8-GPU clusters.
  • Figure 2: Forward-propagation of the MLP portion with a N-way partitioning across N devices. Here, B and L are flattened to fit the common notation of GEMM.
  • Figure 3: An illustration of the prior GEMM-ReduceScatter overlapping with 2-way tensor parallelism.
  • Figure 4: Performance between PyTorch (non-overlapping) and TransformerEngine (prior overlapping method) from m = 1024 to 8192, with (n, k) as (49152, 12288) and (12288, 49152) in AllGather and ReduceScatter on an 8-H800 cluster with NVLink interconnections.
  • Figure 5: An illustration of differences among the non-overlapping and different overlapping methods in a GEMM-ReduceScatter pattern with 2-way tensor parallelism.
  • ...and 12 more figures