Table of Contents
Fetching ...

Slide FFT on a homogeneous mesh in wafer-scale computing

Maurice H. P. M. van Putten, Leighton Wilson, Adam W. Lavely, Mark Hair

TL;DR

This work addresses the memory bottleneck of FFT on conventional architectures by leveraging the CS-2 wafer-scale engine, a homogeneous PE mesh with fast on-chip interconnect, and introducing a synchronous Slide operation to enable compute-limited FFTs for arbitrary transform sizes. The approach reduces non-local data movement, yields near-linear scaling of slide performance with data size, and demonstrates favorable efficiency independent of the transform’s parallelization length. Preliminary CS-2 benchmarks show promising linear scaling and stable per-element costs, suggesting substantial acceleration potential for FFT-based signal processing in applications such as gravitational-wave searches and multi-messenger astronomy. The results indicate a viable path toward high-throughput, low-latency FFT on wafer-scale architectures, with future work focusing on batch processing and host-memory management.

Abstract

Searches for signals at low signal-to-noise ratios frequently involve the Fast Fourier Transform (FFT). For high-throughput searches, we here consider FFT on the homogeneous mesh of Processing Elements (PEs) of a wafer-scale engine (WSE). To minimize memory overhead in the inherently non-local FFT algorithm, we introduce a new synchronous slide operation ({\em Slide}) exploiting the fast interconnect between adjacent PEs. Feasibility of compute-limited performance is demonstrated in linear scaling of Slide execution times with varying array size in preliminary benchmarks on the CS-2 WSE. The proposed implementation appears opportune to accelerate and open the full discovery potential of FFT-based signal processing in multi-messenger astronomy.

Slide FFT on a homogeneous mesh in wafer-scale computing

TL;DR

This work addresses the memory bottleneck of FFT on conventional architectures by leveraging the CS-2 wafer-scale engine, a homogeneous PE mesh with fast on-chip interconnect, and introducing a synchronous Slide operation to enable compute-limited FFTs for arbitrary transform sizes. The approach reduces non-local data movement, yields near-linear scaling of slide performance with data size, and demonstrates favorable efficiency independent of the transform’s parallelization length. Preliminary CS-2 benchmarks show promising linear scaling and stable per-element costs, suggesting substantial acceleration potential for FFT-based signal processing in applications such as gravitational-wave searches and multi-messenger astronomy. The results indicate a viable path toward high-throughput, low-latency FFT on wafer-scale architectures, with future work focusing on batch processing and host-memory management.

Abstract

Searches for signals at low signal-to-noise ratios frequently involve the Fast Fourier Transform (FFT). For high-throughput searches, we here consider FFT on the homogeneous mesh of Processing Elements (PEs) of a wafer-scale engine (WSE). To minimize memory overhead in the inherently non-local FFT algorithm, we introduce a new synchronous slide operation ({\em Slide}) exploiting the fast interconnect between adjacent PEs. Feasibility of compute-limited performance is demonstrated in linear scaling of Slide execution times with varying array size in preliminary benchmarks on the CS-2 WSE. The proposed implementation appears opportune to accelerate and open the full discovery potential of FFT-based signal processing in multi-messenger astronomy.
Paper Structure (7 sections, 13 equations, 6 figures)

This paper contains 7 sections, 13 equations, 6 figures.

Figures (6)

  • Figure 1: FFT performance on a GPU (Radeon VII with HBM2) by efficiency (left scale) and throughput (right scale) in matched filtering, expressed by the product of transform size $N=2^m$ and FFT transform rate. Results are computed with clFFTbrag15 in complex single-precision and out-of-place as a function of transform size $N$ in batch mode, shown across memory allocations $M=64$ MB and $M=1024$MB in Global Memory. Efficiency is limited by memory bandwidth especially when transform sizes exceed the size of Local Memory, noticeably across $N=2^{12}$, leaving about 8% of theoretical peak compute performance in f32. (After Fig. D.1 of van23a.)
  • Figure 2: (Left panel.) Following index permutation of input data, FFT takes a path in reverse over $m=\log_2n$ levels $p=m,m-1,\cdots,1$. At each $p$, adjacent segments $E$, $O$ are concatenated following a rotation (\ref{['EQN_LR']}), amenable to embarrassingly parallel computing of $n/2^p$ crossings. (Right panel.) Crossing diagram of rotations (\ref{['EQN_EO']}) of adjacent segments showing (\ref{['EQN_LR']}) (wiggly line), subsequent to alignment of $E$ and $O$ in shared memory, followed by sliding output $L,R$ back as input to level $p-1$.
  • Figure 3: Memory overhead in aligning $E$ and $O$ in (\ref{['EQN_LR']}) at level $p$. From top to bottom: coherent shifts to a shared memory location, $E\times O\rightarrow LR$, followed by coherent shifts back to the original location of $E,O$. The concatenated segment (white) is $E$ or $O$ at level $p-1$.
  • Figure 4: Architecture of the wafer-scale engine CS-2 comprising a homogeneous grid of 850,000 PEs (left). Each PE consists of a Compute Element (CE) supported by 48 kB of SRAM Local Memory. Local Memory is connected over a bus to a router enabling bi-directional communication to its nearest neighbors (right).
  • Figure 5: Performance of Slide Eq. (\ref{['EQN_s']}) in cycles per array element versus total number of elements in individual benchmarks on the CS-2 (left top) versus simulated in the Cerebras SDK (right top). Data are distributed evenly over 8, 16, and 32 PEs, the number of elements per PE varying from 1 to 500. (Lower panel.) Results for maximal data arrays over $4k$ PEs ($k=2,3,\cdots, 8)$ shows performance to be essentially invariant to the number of PEs used by regular convergence to about 1.3 cycles per data element.
  • ...and 1 more figures