Table of Contents
Fetching ...

ZCCL: Significantly Improving Collective Communication With Error-Bounded Lossy Compression

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Zhaorui Zhang, Jinyang Liu, Xiaoyi Lu, Ken Raffenetti, Hui Zhou, Kai Zhao, Khalid Alharthi, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

TL;DR

ZCCL proposes error-bounded lossy compression to significantly accelerate MPI collectives by introducing two optimization frameworks tailored to data movement and computation. The framework selects fZ-light as the best-performing compressor among high-speed options and masks communication cost by overlapping or hiding compression steps, all while maintaining bounded error guarantees. Experimental results across Allgather, Allreduce, Scatter, Broadcast, and image stacking show speedups up to 1.9–8.9× over baselines and strong scalability on a 128-node HPC cluster. The work demonstrates generalizability to multiple collectives and lays groundwork for deploying bounded lossy compression in diverse HPC workloads, with future plans toward more collectives and accelerator platforms.

Abstract

With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communication turns out to be a critical bottleneck in large-scale distributed and parallel processing. The large message size in MPI collectives is particularly concerning because it can significantly degrade overall parallel performance. To address this issue, prior research simply applies off-the-shelf fixed-rate lossy compressors in the MPI collectives, leading to suboptimal performance, limited generalizability, and unbounded errors. In this paper, we propose a novel solution, called ZCCL, which leverages error-bounded lossy compression to significantly reduce the message size, resulting in a substantial reduction in communication costs. The key contributions are three-fold. (1) We develop two general, optimized lossy-compression-based frameworks for both types of MPI collectives (collective data movement as well as collective computation), based on their particular characteristics. Our framework not only reduces communication costs but also preserves data accuracy. (2) We customize fZ-light, an ultra-fast error-bounded lossy compressor, to meet the specific needs of collective communication. (3) We integrate ZCCL into multiple collectives, such as Allgather, Allreduce, Scatter, and Broadcast, and perform a comprehensive evaluation based on real-world scientific application datasets. Experiments show that our solution outperforms the original MPI collectives as well as multiple baselines by 1.9--8.9X.

ZCCL: Significantly Improving Collective Communication With Error-Bounded Lossy Compression

TL;DR

ZCCL proposes error-bounded lossy compression to significantly accelerate MPI collectives by introducing two optimization frameworks tailored to data movement and computation. The framework selects fZ-light as the best-performing compressor among high-speed options and masks communication cost by overlapping or hiding compression steps, all while maintaining bounded error guarantees. Experimental results across Allgather, Allreduce, Scatter, Broadcast, and image stacking show speedups up to 1.9–8.9× over baselines and strong scalability on a 128-node HPC cluster. The work demonstrates generalizability to multiple collectives and lays groundwork for deploying bounded lossy compression in diverse HPC workloads, with future plans toward more collectives and accelerator platforms.

Abstract

With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communication turns out to be a critical bottleneck in large-scale distributed and parallel processing. The large message size in MPI collectives is particularly concerning because it can significantly degrade overall parallel performance. To address this issue, prior research simply applies off-the-shelf fixed-rate lossy compressors in the MPI collectives, leading to suboptimal performance, limited generalizability, and unbounded errors. In this paper, we propose a novel solution, called ZCCL, which leverages error-bounded lossy compression to significantly reduce the message size, resulting in a substantial reduction in communication costs. The key contributions are three-fold. (1) We develop two general, optimized lossy-compression-based frameworks for both types of MPI collectives (collective data movement as well as collective computation), based on their particular characteristics. Our framework not only reduces communication costs but also preserves data accuracy. (2) We customize fZ-light, an ultra-fast error-bounded lossy compressor, to meet the specific needs of collective communication. (3) We integrate ZCCL into multiple collectives, such as Allgather, Allreduce, Scatter, and Broadcast, and perform a comprehensive evaluation based on real-world scientific application datasets. Experiments show that our solution outperforms the original MPI collectives as well as multiple baselines by 1.9--8.9X.

Paper Structure

This paper contains 32 sections, 4 theorems, 4 equations, 16 figures, 7 tables.

Key Result

Theorem 1

Based on the above analysis, the final aggregated error for Sum operation falls into the interval $[-2\sqrt{n}\sigma, 2\sqrt{n}\sigma]$ with the probability of $95.44\%$, where $n$ is the number of computing nodes in MPI and $\sigma$ is the variance of the error bound of the lossy compressor.

Figures (16)

  • Figure 1: Design architecture (yellow box: applications; green box: new contributed modules; purple box: third-party).
  • Figure 2: High-level design of our collective data movement framework in the ring-based allgather algorithm to mitigate compression error propagation. $A$ means the original data and $A_c$ means the compressed data. This rule applies to other data chunks as well. This algorithm completes in $N$$-$1 rounds, where $N$ is the number of processes.
  • Figure 3: High-level design of our collective data movement framework in the binomial tree broadcast algorithm. It completes in $log_2{N}$ rounds, where $N$ is the number of processes.
  • Figure 4: High-level design of our collective computation framework in the ring-based reduce-scatter algorithm.
  • Figure 5: Exemplifying the normal distribution property of compression errors.
  • ...and 11 more figures

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Corollary 1
  • Corollary 2
  • Theorem 2
  • proof