TurboFFT: A High-Performance Fast Fourier Transform with Fault Tolerance on GPU
Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Huangliang Dai, Sheng Di, Zizhong Chen, Franck Cappello
TL;DR
TurboFFT tackles the reliability challenge of FFT against silent data corruption on GPUs by introducing a two-sided ABFT scheme that enables on-the-fly error detection and delayed batched correction, avoiding recomputation. It combines a sequence of architecture-aware optimizations (tiling, workload balancing, twiddle-factor handling, memory access patterns) with a kernel-fusion approach and a template-based code generation framework to support broad input sizes and data types. The key contributions are the two-sided ABFT design at thread and threadblock levels, the fusion-based integration with the FFT kernel, and the automatic generation of optimized kernels, all validated on NVIDIA A100 and T4 GPUs where fault-tolerance overhead remains modest (roughly 7–15% vs cuFFT) and fault-injection scenarios remain robust. The results demonstrate that TurboFFT achieves competitive or superior performance relative to cuFFT while providing reliable protection against faults, offering practical impact for exascale and safety-critical GPU workloads where silent data corruption is a concern.
Abstract
The Fast Fourier Transform (FFT), as a core computation in a wide range of scientific applications, is increasingly threatened by reliability issues. In this paper, we introduce TurboFFT, a high-performance FFT implementation equipped with a two-sided checksum scheme that detects and corrects silent data corruptions at computing units efficiently. The proposed two-sided checksum addresses the error propagation issue by encoding a batch of input signals with different linear combinations, which not only allows fast batched error detection but also enables error correction on-the-fly instead of recomputing. We explore two-sided checksum designs at the kernel, thread, and threadblock levels, and provide a baseline FFT implementation competitive to the state-of-the-art, closed-source cuFFT. We demonstrate a kernel fusion strategy to mitigate and overlap the computation/memory overhead introduced by fault tolerance with underlying FFT computation. We present a template-based code generation strategy to reduce development costs and support a wide range of input sizes and data types. Experimental results on an NVIDIA A100 server GPU and a Tesla Turing T4 GPU demonstrate TurboFFT offers a competitive or superior performance compared to the closed-source library cuFFT. TurboFFT only incurs a minimum overhead (7\% to 15\% on average) compared to cuFFT, even under hundreds of error injections per minute for both single and double precision. TurboFFT achieves a 23\% improvement compared to existing fault tolerance FFT schemes.
