HiCCL: A Hierarchical Collective Communication Library
Mert Hidayetoglu, Simon Garcia de Gonzalo, Elliott Slaughter, Pinku Surana, Wen-mei Hwu, William Gropp, Alex Aiken
TL;DR
HiCCL tackles the challenge of delivering high-throughput, portable collective communications on modern hierarchical GPU networks by introducing a machine-agnostic, compositional API built from multicast, reduction, and fence primitives. It automates performance portability through a five-parameter optimization space (hierarchy, per-level libraries, striping, ring size, and pipeline depth) and supports hybrid topologies (tree, ring, and their combinations) with multi-NIC striping and pipelining. Empirical evaluation across four systems with Nvidia, AMD, and Intel GPUs shows up to a 17x geometric mean throughput improvement over MPI implementations and competitive results versus vendor libraries, with strong scaling up to hundreds of GPUs. The results demonstrate HiCCL’s potential to unify and accelerate portable, high-performance collectives in heterogeneous HPC environments.
Abstract
HiCCL (Hierarchical Collective Communication Library) addresses the growing complexity and diversity in high-performance network architectures. As GPU systems have envolved into networks of GPUs with different multilevel communication hierarchies, optimizing each collective function for a specific system has become a challenging task. Consequently, many collective libraries struggle to adapt to different hardware and software, especially across systems from different vendors. HiCCL's library design decouples the collective communication logic from network-specific optimizations through a compositional API. The communication logic is composed using multicast, reduction, and fence primitives, which are then factorized for a specified network hieararchy using only point-to-point operations within a level. Finally, striping and pipelining optimizations applied as specified for streamlining the execution. Performance evaluation of HiCCL across four different machines$\unicode{x2014}$two with Nvidia GPUs, one with AMD GPUs, and one with Intel GPUs$\unicode{x2014}$demonstrates an average 17$\times$ higher throughput than the collectives of highly specialized GPU-aware MPI implementations, and competitive throughput with those of vendor-specific libraries (NCCL, RCCL, and OneCCL), while providing portability across all four machines.
