Slide FFT on a homogeneous mesh in wafer-scale computing
Maurice H. P. M. van Putten, Leighton Wilson, Adam W. Lavely, Mark Hair
TL;DR
This work addresses the memory bottleneck of FFT on conventional architectures by leveraging the CS-2 wafer-scale engine, a homogeneous PE mesh with fast on-chip interconnect, and introducing a synchronous Slide operation to enable compute-limited FFTs for arbitrary transform sizes. The approach reduces non-local data movement, yields near-linear scaling of slide performance with data size, and demonstrates favorable efficiency independent of the transform’s parallelization length. Preliminary CS-2 benchmarks show promising linear scaling and stable per-element costs, suggesting substantial acceleration potential for FFT-based signal processing in applications such as gravitational-wave searches and multi-messenger astronomy. The results indicate a viable path toward high-throughput, low-latency FFT on wafer-scale architectures, with future work focusing on batch processing and host-memory management.
Abstract
Searches for signals at low signal-to-noise ratios frequently involve the Fast Fourier Transform (FFT). For high-throughput searches, we here consider FFT on the homogeneous mesh of Processing Elements (PEs) of a wafer-scale engine (WSE). To minimize memory overhead in the inherently non-local FFT algorithm, we introduce a new synchronous slide operation ({\em Slide}) exploiting the fast interconnect between adjacent PEs. Feasibility of compute-limited performance is demonstrated in linear scaling of Slide execution times with varying array size in preliminary benchmarks on the CS-2 WSE. The proposed implementation appears opportune to accelerate and open the full discovery potential of FFT-based signal processing in multi-messenger astronomy.
