HiCCL: A Hierarchical Collective Communication Library

Mert Hidayetoglu; Simon Garcia de Gonzalo; Elliott Slaughter; Pinku Surana; Wen-mei Hwu; William Gropp; Alex Aiken

HiCCL: A Hierarchical Collective Communication Library

Mert Hidayetoglu, Simon Garcia de Gonzalo, Elliott Slaughter, Pinku Surana, Wen-mei Hwu, William Gropp, Alex Aiken

TL;DR

HiCCL tackles the challenge of delivering high-throughput, portable collective communications on modern hierarchical GPU networks by introducing a machine-agnostic, compositional API built from multicast, reduction, and fence primitives. It automates performance portability through a five-parameter optimization space (hierarchy, per-level libraries, striping, ring size, and pipeline depth) and supports hybrid topologies (tree, ring, and their combinations) with multi-NIC striping and pipelining. Empirical evaluation across four systems with Nvidia, AMD, and Intel GPUs shows up to a 17x geometric mean throughput improvement over MPI implementations and competitive results versus vendor libraries, with strong scaling up to hundreds of GPUs. The results demonstrate HiCCL’s potential to unify and accelerate portable, high-performance collectives in heterogeneous HPC environments.

Abstract

HiCCL (Hierarchical Collective Communication Library) addresses the growing complexity and diversity in high-performance network architectures. As GPU systems have envolved into networks of GPUs with different multilevel communication hierarchies, optimizing each collective function for a specific system has become a challenging task. Consequently, many collective libraries struggle to adapt to different hardware and software, especially across systems from different vendors. HiCCL's library design decouples the collective communication logic from network-specific optimizations through a compositional API. The communication logic is composed using multicast, reduction, and fence primitives, which are then factorized for a specified network hieararchy using only point-to-point operations within a level. Finally, striping and pipelining optimizations applied as specified for streamlining the execution. Performance evaluation of HiCCL across four different machines$\unicode{x2014}$two with Nvidia GPUs, one with AMD GPUs, and one with Intel GPUs$\unicode{x2014}$demonstrates an average 17$\times$ higher throughput than the collectives of highly specialized GPU-aware MPI implementations, and competitive throughput with those of vendor-specific libraries (NCCL, RCCL, and OneCCL), while providing portability across all four machines.

HiCCL: A Hierarchical Collective Communication Library

TL;DR

Abstract

two with Nvidia GPUs, one with AMD GPUs, and one with Intel GPUs

demonstrates an average 17

higher throughput than the collectives of highly specialized GPU-aware MPI implementations, and competitive throughput with those of vendor-specific libraries (NCCL, RCCL, and OneCCL), while providing portability across all four machines.

Paper Structure (33 sections, 2 equations, 10 figures, 5 tables)

This paper contains 33 sections, 2 equations, 10 figures, 5 tables.

Introduction
Background
Conventional Libraries and Collective Functions
Hierarchical Communication
Communications Across Multi-NIC Nodes
Composition of Collectives
Collective Primitives
Single-Step Collectives
Multi-Step Collectives
Optimizations
Optimization Space
Hierarchical Tree Structure
Multi-NIC Striping
Hybrid Ring+Tree Topology
Pipelining
...and 18 more sections

Figures (10)

Figure 1: Broadcasting $d$ bytes across six GPUs with (a) direct and (b) hierarchical ways. Each black dot corresponds to a GPU endpoint. Each set of three GPUs corresponds to a compute node. (a) Direct implementation redundantly moves three copies (blue) of data across nodes. (b) Hierarchical optimization moves a single copy across (blue) nodes, and distribute additional copies within (maroon) nodes.
Figure 2: Various associations across $g$ GPUs and $k$ NICs per node ($k\le g$). In our test systems, each GPU is logically binded to a single NIC via (a) packed, (b) round-robin, or (c) bijective associations.
Figure 3: The (a) multicast and (b) reduction primitives form a simple tree structue with one root and multiple leaves.
Figure 4: Composition of (c) All-Reduce function as (a) Reduce-Scatter followed by an (b) All-Gather on three processes. The registration takes three reduction primitives, followed by a fence, and then followed by three multicast primitives. The dashed edges on the broadcasts can be omitted for in-place implementation.
Figure 5: Various tree structures and their notations across 24 GPUs. The examples shows (a)--(b) two, (c)--(d) three, and (e)--(f) four levels of hierarchies. The colors represents different communication links across: level 1 (red), level 2 (yellow), level 3 (green), and leaf (blue) levels. HiCCL implements each level with a the chosen communication library.
...and 5 more figures

HiCCL: A Hierarchical Collective Communication Library

TL;DR

Abstract

HiCCL: A Hierarchical Collective Communication Library

Authors

TL;DR

Abstract

Table of Contents

Figures (10)