gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Jiajun Huang; Sheng Di; Xiaodong Yu; Yujia Zhai; Jinyang Liu; Yafan Huang; Ken Raffenetti; Hui Zhou; Kai Zhao; Xiaoyi Lu; Zizhong Chen; Franck Cappello; Yanfei Guo; Rajeev Thakur

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

TL;DR

The paper addresses the bottleneck of large-message GPU-aware collective communications by integrating an accuracy-aware, lossy compression framework into a GPU-centric design. It introduces gZCCL, a general framework with two algorithm design frameworks and two optimization pipelines to maximize GPU utilization while bounding error propagation. Empirical results on up to 512 NVIDIA A100 GPUs show substantial speedups over Cray MPI and NCCL for both Allreduce and Scatter, and real-world image-stacking confirms data quality remains high under compression. The work provides a practical path to scalable, compression-assisted collectives and sets the stage for broader hardware integration in exascale environments.

Abstract

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5X and 28.7X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

TL;DR

Abstract

Paper Structure (26 sections, 13 figures, 2 tables)

This paper contains 26 sections, 13 figures, 2 tables.

Introduction
Background and Related Work
gZCCL Design and Optimization
Analysis of existing compression-enabled GPU-aware collectives
Inefficient prior solutions in GPU-aware collectives
Identification of the bottlenecks in prior related works
Characterization of ring-based compression-enabled GPU-aware collectives
Traditional ring-based algorithms for long messages
Characterization of GPU lossy compressor
Ring-based collective computation
Proposing the novel gZCCL framework
Getting rid of the traditional host-centric design
Adapting lossy compression to achieve high collective performance
Two algorithm design frameworks
Two performance optimization frameworks
...and 11 more sections

Figures (13)

Figure 1: gZCCL design architecture.
Figure 2: Performance breakdown of Allreduce using CPRP2P and C-Coll: CPRP2P's first percentage is scaled to C-Coll's runtime, and the second is scaled to its own.
Figure 3: Characterization of cuSZp compression and decompression execution time with uniform data.
Figure 4: Design of our gZCCL collective computation framework on compression-accelerated gZ-Allreduce. This example uses four GPUs/Processes.
Figure 5: Design of our gZCCL data movement framework on compression-accelerated gZ-Scatter. This example uses four GPUs/Processes.
...and 8 more figures

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

TL;DR

Abstract

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Authors

TL;DR

Abstract

Table of Contents

Figures (13)