Table of Contents
Fetching ...

Scalability of 3D-DFT by block tensor-matrix multiplication on the JUWELS Cluster

Nitin Malapally, Viacheslav Bolnykh, Estela Suarez, Paolo Carloni, Thomas Lippert, Davide Mandelli

TL;DR

This work addresses the scalability gap of 3D-DFT on distributed-memory systems by proposing a block tensor-matrix multiplication (BTMM) approach that uses point-to-point communication instead of all-to-all in an adapted Cannon's algorithm. Implemented as the S3DFT C++ library, the method rewrites the 3D-DFT as sequential tensor-matrix multiplications with a volumetric $p^3$-PE decomposition and a permutation-based assembly of results, achieving up to $88\%$ of the single-node peak in shared memory. Benchmark comparisons against FFTW3 and Intel MKL show that iMKL remains the fastest overall, FFTW3 second, and S3DFT lags due to distributed-memory communication patterns; nonetheless, one DM variant scales well while others fail to scale on JUWELS. The results illustrate both the potential of BTMM-based 3D-DFT on modern HPC hardware and the need for improved DM communication strategies, especially on networks with lower latency and different topologies, to become competitive with FFT-based approaches on large-scale systems.

Abstract

The 3D Discrete Fourier Transform (DFT) is a technique used to solve problems in disparate fields. Nowadays, the commonly adopted implementation of the 3D-DFT is derived from the Fast Fourier Transform (FFT) algorithm. However, evidence indicates that the distributed memory 3D-FFT algorithm does not scale well due to its use of all-to-all communication. Here, building on the work of Sedukhin \textit{et al}. [Proceedings of the 30th International Conference on Computers and Their Applications, CATA 2015 pp. 193-200 (01 2015)], we revisit the possibility of improving the scaling of the 3D-DFT by using an alternative approach that uses point-to-point communication, albeit at a higher arithmetic complexity. The new algorithm exploits tensor-matrix multiplications on a volumetrically decomposed domain via three specially adapted variants of Cannon's algorithm. It has here been implemented as a C++ library called S3DFT and tested on the JUWELS Cluster at the Jülich Supercomputing Center. Our implementation of the shared memory tensor-matrix multiplication attained 88\% of the theoretical single node peak performance. One variant of the distributed memory tensor-matrix multiplication shows excellent scaling, while the other two show poorer performance, which can be attributed to their intrinsic communication patterns. A comparison of S3DFT with the Intel MKL and FFTW3 libraries indicates that currently iMKL performs best overall, followed in order by FFTW3 and S3DFT. This picture might change with further improvements of the algorithm and/or when running on clusters that use network connections with higher latency, e.g. on cloud platforms.

Scalability of 3D-DFT by block tensor-matrix multiplication on the JUWELS Cluster

TL;DR

This work addresses the scalability gap of 3D-DFT on distributed-memory systems by proposing a block tensor-matrix multiplication (BTMM) approach that uses point-to-point communication instead of all-to-all in an adapted Cannon's algorithm. Implemented as the S3DFT C++ library, the method rewrites the 3D-DFT as sequential tensor-matrix multiplications with a volumetric -PE decomposition and a permutation-based assembly of results, achieving up to of the single-node peak in shared memory. Benchmark comparisons against FFTW3 and Intel MKL show that iMKL remains the fastest overall, FFTW3 second, and S3DFT lags due to distributed-memory communication patterns; nonetheless, one DM variant scales well while others fail to scale on JUWELS. The results illustrate both the potential of BTMM-based 3D-DFT on modern HPC hardware and the need for improved DM communication strategies, especially on networks with lower latency and different topologies, to become competitive with FFT-based approaches on large-scale systems.

Abstract

The 3D Discrete Fourier Transform (DFT) is a technique used to solve problems in disparate fields. Nowadays, the commonly adopted implementation of the 3D-DFT is derived from the Fast Fourier Transform (FFT) algorithm. However, evidence indicates that the distributed memory 3D-FFT algorithm does not scale well due to its use of all-to-all communication. Here, building on the work of Sedukhin \textit{et al}. [Proceedings of the 30th International Conference on Computers and Their Applications, CATA 2015 pp. 193-200 (01 2015)], we revisit the possibility of improving the scaling of the 3D-DFT by using an alternative approach that uses point-to-point communication, albeit at a higher arithmetic complexity. The new algorithm exploits tensor-matrix multiplications on a volumetrically decomposed domain via three specially adapted variants of Cannon's algorithm. It has here been implemented as a C++ library called S3DFT and tested on the JUWELS Cluster at the Jülich Supercomputing Center. Our implementation of the shared memory tensor-matrix multiplication attained 88\% of the theoretical single node peak performance. One variant of the distributed memory tensor-matrix multiplication shows excellent scaling, while the other two show poorer performance, which can be attributed to their intrinsic communication patterns. A comparison of S3DFT with the Intel MKL and FFTW3 libraries indicates that currently iMKL performs best overall, followed in order by FFTW3 and S3DFT. This picture might change with further improvements of the algorithm and/or when running on clusters that use network connections with higher latency, e.g. on cloud platforms.
Paper Structure (12 sections, 8 equations, 8 figures, 2 tables, 5 algorithms)

This paper contains 12 sections, 8 equations, 8 figures, 2 tables, 5 algorithms.

Figures (8)

  • Figure 1: Visualization of the procedure to compute the tensor-matrix multiplication as a set of independent matrix-matrix multiplications. Left and right panels are for $\rho_R$ and $\rho_L$, respectively.
  • Figure 2: Volumetric decomposition: a tensor and a matrix are broken down into $p^3$ and $p^2$ blocks, respectively (here, $p=3$). Each block is locally allocated for by a corresponding PE, as indicated in the circles.
  • Figure 3: Comparison of the effective bandwidths achieved by the DAXPY kernel and the naïve and optimized versions of the transpose function. Left and right panels report results obtained in single and dual NUMA domains configurations, respectively.
  • Figure 4: The left panel shows the performance of the shared memory tensor-matrix multiplication as a function of block size and the right panel, its strong scaling behaviour. Grey and blue curves are for single (24 cores) and dual (48 cores) NUMA confgurations, respectively. The corresponding problem sizes are $N=900,1100$. The red line in the right panel indicates the peak performance of the single node. The grey dashed line indicates the ideal linear scaling.
  • Figure 5: Fitted curves showing the duration of the communication event (continuous lines, different node counts) and that of the local update (dotted line) as functions of block size. The black dots indicate the intersections at which perfect overlapping can be expected. Left and right panels report results obtained in 2 MPI tasks/node and 1 MPI task/node configurations, respectively.
  • ...and 3 more figures