FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Li-Wen Chang; Wenlei Bao; Qi Hou; Chengquan Jiang; Ningxin Zheng; Yinmin Zhong; Xuanrun Zhang; Zuquan Song; Chengji Yao; Ziheng Jiang; Haibin Lin; Xin Jin; Xin Liu

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu

TL;DR

Flux tackles the communication bottleneck in tensor-parallel GPU workloads by over-decomposing and fusing communication with GEMM into a single kernel. By implementing ReduceScatter and AllGather as epilogue and prologue fused operations within a GEMM kernel and employing tile coordinate swizzling and auto-tuning, Flux achieves substantial overlap and improved compute utilization across diverse GPU interconnects. Empirical results show notable speedups for both training and inference on 128-GPU and 8-GPU clusters, with higher gains when communication proportions are large. The approach offers a practical path to scalable, high-performance tensor parallelism for large deep learning models.

Abstract

Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

TL;DR

Abstract

Paper Structure (21 sections, 2 equations, 17 figures, 3 algorithms)

This paper contains 21 sections, 2 equations, 17 figures, 3 algorithms.

Introduction
Backgrounds on Tensor Parallelism and Overlapping
Common Partitioning and Communication Patterns
Conventional Communication Overlapping Strategies
Effective Communication Time and Overlapping Efficiency
Overview of Fused GEMM with Communication
ReduceScatter Overlapping
AllGather Overlapping
Comparison among Decomposition Strategies
Optimizations and Implementation Details
Tile Coordinate Swizzling
Implementation Details of ReduceScatter
Implementation Details of AllGather
GEMM Implementation and Auto-Tuning
Evaluation
...and 6 more sections

Figures (17)

Figure 1: Non-overlapped communication portion within tensor parallelism in common LLM workloads for training with 2-way data, 8-way pipeline, 8-way tensor parallelism on various 128-GPU clusters, and inference with 8-way tensor parallelism on various 8-GPU clusters.
Figure 2: Forward-propagation of the MLP portion with a N-way partitioning across N devices. Here, B and L are flattened to fit the common notation of GEMM.
Figure 3: An illustration of the prior GEMM-ReduceScatter overlapping with 2-way tensor parallelism.
Figure 4: Performance between PyTorch (non-overlapping) and TransformerEngine (prior overlapping method) from m = 1024 to 8192, with (n, k) as (49152, 12288) and (12288, 49152) in AllGather and ReduceScatter on an 8-H800 cluster with NVLink interconnections.
Figure 5: An illustration of differences among the non-overlapping and different overlapping methods in a GEMM-ReduceScatter pattern with 2-way tensor parallelism.
...and 12 more figures

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

TL;DR

Abstract

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (17)