An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression

Jiajun Huang; Sheng Di; Xiaodong Yu; Yujia Zhai; Zhaorui Zhang; Jinyang Liu; Xiaoyi Lu; Ken Raffenetti; Hui Zhou; Kai Zhao; Zizhong Chen; Franck Cappello; Yanfei Guo; Rajeev Thakur

An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Zhaorui Zhang, Jinyang Liu, Xiaoyi Lu, Ken Raffenetti, Hui Zhou, Kai Zhao, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

TL;DR

This work addresses the bottleneck of large-message MPI collectives by introducing C-Coll, a general framework that uses error-bounded lossy compression to reduce message sizes while maintaining bounded accuracy. It comprises two novel frameworks tailored to collective data movement and collective computation, plus a customized PIPE-SZx compressor, enabling practical integration into MPI primitives like Allreduce, Scatter, and Bcast. The authors provide a theoretical analysis of error propagation, select SZx as the best-performing compressor for this context, and validate substantial performance gains (up to around 2×) across real-world scientific datasets and hardware. The approach generalizes across multiple collectives, demonstrates robust accuracy for image stacking, and offers a foundation for deploying compression-enabled collectives on broader HPC systems and future architectures.

Abstract

With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communications turns out to be a critical bottleneck in large-scale distributed and parallel processing. The large message size in MPI collectives is particularly concerning because it can significantly degrade the overall parallel performance. To address this issue, prior research simply applies the off-the-shelf fix-rate lossy compressors in the MPI collectives, leading to suboptimal performance, limited generalizability, and unbounded errors. In this paper, we propose a novel solution, called C-Coll, which leverages error-bounded lossy compression to significantly reduce the message size, resulting in a substantial reduction in communication cost. The key contributions are three-fold. (1) We develop two general, optimized lossy-compression-based frameworks for both types of MPI collectives (collective data movement as well as collective computation), based on their particular characteristics. Our framework not only reduces communication cost but also preserves data accuracy. (2) We customize SZx, an ultra-fast error-bounded lossy compressor, to meet the specific needs of collective communication. (3) We integrate C-Coll into multiple collectives, such as MPI_Allreduce, MPI_Scatter, and MPI_Bcast, and perform a comprehensive evaluation based on real-world scientific datasets. Experiments show that our solution outperforms the original MPI collectives as well as multiple baselines and related efforts by 1.8-2.7X.

An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression

TL;DR

Abstract

Paper Structure (30 sections, 4 theorems, 4 equations, 18 figures, 6 tables)

This paper contains 30 sections, 4 theorems, 4 equations, 18 figures, 6 tables.

Introduction
Background and Related Work
MPI Collective Communication
Collective data movement
Collective computation
High-speed Lossy Compressors
Lossy Compression-enabled MPI Implementations
C-Coll Design and Optimization
Two Proposed Novel Frameworks for Compression-enhanced Collectives
Collective data movement framework
Collective computation framework
Theoretical Analysis of Error Propagation in C-Coll
Identify Best-qualified High-speed Error-bounded Lossy Compressor
Characterization of Performance Bottlenecks
Step-wise Optimizations
...and 15 more sections

Key Result

Theorem 1

Based on the above analysis, the final aggregated error for Sum operation falls into the interval $[-2\sqrt{n}\sigma, 2\sqrt{n}\sigma]$ with the probability of $95.44\%$, where $n$ is the number of computing nodes in MPI and $\sigma$ is the variance of the error bound of the lossy compressor.

Figures (18)

Figure 1: Design architecture (yellow box: applications; green box: new contributed modules; purple box: third-party).
Figure 2: High-level design of our collective data movement framework in the ring-based allgather algorithm to mitigate compression error propagation. $A$ means the original data and $A_c$ means the compressed data. This rule applies to other data chunks as well. This algorithm completes in $N$$-$1 rounds, where $N$ is the number of processes.
Figure 3: High-level design of our collective data movement framework in the binomial tree broadcast algorithm. It completes in $log_2{N}$ rounds, where $N$ is the number of processes.
Figure 4: High-level design of our collective computation framework in the ring-based reduce-scatter algorithm.
Figure 5: Exemplifying the normal distribution property of compression errors.
...and 13 more figures

Theorems & Definitions (6)

Theorem 1
proof
Corollary 1
Corollary 2
Theorem 2
proof

An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression

TL;DR

Abstract

An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (6)