Table of Contents
Fetching ...

A HPX Communication Benchmark: Distributed FFT using Collectives

Alexander Strack, Dirk Pflüger

TL;DR

This work analyzes explicit collective communication within HPX by benchmarking three parcelport backends (TCP, MPI, LCI) on a distributed FFT workload and comparing against FFTW3's MPI+X implementation. The distributed 2D FFT relies on collectives, with the authors proposing an N-scatter alternative to all-to-all to increase asynchrony. Results show the LCI parcelport delivering the best performance for both scatter and all-to-all patterns on a 16-node cluster, beating FFTW3 MPI+X by up to 3x, while TCP incurs higher overhead for small chunks. The study highlights the potential of HPX parcelports to provide flexible, high-performance communication abstractions for distributed HPC applications and points to future work adding more parcelport options and accompanying materials.

Abstract

Due to increasing core counts in modern processors, several task-based runtimes emerged, including the C++ Standard Library for Concurrency and Parallelism (HPX). Although the asynchronous many-task runtime HPX allows implicit communication via an Active Global Address Space, it also supports explicit collective operations. Collectives are an efficient way to realize complex communication patterns. In this work, we benchmark the TCP, MPI, and LCI communication backends of HPX, which are called parcelports in HPX terms. We use a distributed multi-dimensional FFT application relying on collectives. Furthermore, we compare the performance of the HPX all-to-all and scatter collectives against an FFTW3 reference based on MPI+X on a 16-node cluster. Of the three parcelports, LCI performed best for both scatter and all-to-all collectives. Furthermore, the LCI parcelport was up to factor 3 faster than the MPI+X reference. Our results highlight the potential of message abstractions and the parcelports of HPX.

A HPX Communication Benchmark: Distributed FFT using Collectives

TL;DR

This work analyzes explicit collective communication within HPX by benchmarking three parcelport backends (TCP, MPI, LCI) on a distributed FFT workload and comparing against FFTW3's MPI+X implementation. The distributed 2D FFT relies on collectives, with the authors proposing an N-scatter alternative to all-to-all to increase asynchrony. Results show the LCI parcelport delivering the best performance for both scatter and all-to-all patterns on a 16-node cluster, beating FFTW3 MPI+X by up to 3x, while TCP incurs higher overhead for small chunks. The study highlights the potential of HPX parcelports to provide flexible, high-performance communication abstractions for distributed HPC applications and points to future work adding more parcelport options and accompanying materials.

Abstract

Due to increasing core counts in modern processors, several task-based runtimes emerged, including the C++ Standard Library for Concurrency and Parallelism (HPX). Although the asynchronous many-task runtime HPX allows implicit communication via an Active Global Address Space, it also supports explicit collective operations. Collectives are an efficient way to realize complex communication patterns. In this work, we benchmark the TCP, MPI, and LCI communication backends of HPX, which are called parcelports in HPX terms. We use a distributed multi-dimensional FFT application relying on collectives. Furthermore, we compare the performance of the HPX all-to-all and scatter collectives against an FFTW3 reference based on MPI+X on a 16-node cluster. Of the three parcelports, LCI performed best for both scatter and all-to-all collectives. Furthermore, the LCI parcelport was up to factor 3 faster than the MPI+X reference. Our results highlight the potential of message abstractions and the parcelports of HPX.

Paper Structure

This paper contains 5 sections, 5 figures.

Figures (5)

  • Figure 1: Four steps that need to be executed in sequence for each dimension of a two-dimensional FFT on two separated memory domains.
  • Figure 2: Hardware specification of benchmark cluster
  • Figure 3: Chunk size scaling on two nodes.
  • Figure 4: Strong scaling on up to 16 nodes for HPX all-to-all collective.
  • Figure 5: Strong scaling on up to 16 nodes for HPX scatter collective.