Table of Contents
Fetching ...

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

TL;DR

The paper addresses the bottleneck of large-message GPU-aware collective communications by integrating an accuracy-aware, lossy compression framework into a GPU-centric design. It introduces gZCCL, a general framework with two algorithm design frameworks and two optimization pipelines to maximize GPU utilization while bounding error propagation. Empirical results on up to 512 NVIDIA A100 GPUs show substantial speedups over Cray MPI and NCCL for both Allreduce and Scatter, and real-world image-stacking confirms data quality remains high under compression. The work provides a practical path to scalable, compression-assisted collectives and sets the stage for broader hardware integration in exascale environments.

Abstract

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5X and 28.7X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

TL;DR

The paper addresses the bottleneck of large-message GPU-aware collective communications by integrating an accuracy-aware, lossy compression framework into a GPU-centric design. It introduces gZCCL, a general framework with two algorithm design frameworks and two optimization pipelines to maximize GPU utilization while bounding error propagation. Empirical results on up to 512 NVIDIA A100 GPUs show substantial speedups over Cray MPI and NCCL for both Allreduce and Scatter, and real-world image-stacking confirms data quality remains high under compression. The work provides a practical path to scalable, compression-assisted collectives and sets the stage for broader hardware integration in exascale environments.

Abstract

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5X and 28.7X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
Paper Structure (26 sections, 13 figures, 2 tables)

This paper contains 26 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: gZCCL design architecture.
  • Figure 2: Performance breakdown of Allreduce using CPRP2P and C-Coll: CPRP2P's first percentage is scaled to C-Coll's runtime, and the second is scaled to its own.
  • Figure 3: Characterization of cuSZp compression and decompression execution time with uniform data.
  • Figure 4: Design of our gZCCL collective computation framework on compression-accelerated gZ-Allreduce. This example uses four GPUs/Processes.
  • Figure 5: Design of our gZCCL data movement framework on compression-accelerated gZ-Scatter. This example uses four GPUs/Processes.
  • ...and 8 more figures