Computing FFTs at Target Precision Using Lower-Precision FFTs

Shota Kawakami; Daisuke Takahashi

Computing FFTs at Target Precision Using Lower-Precision FFTs

Shota Kawakami, Daisuke Takahashi

Abstract

Modern processors deliver higher throughput for lower-precision arithmetic than for higher-precision arithmetic. For matrix multiplication, the Ozaki scheme exploits this performance gap by splitting the inputs into lower-precision components and delegating the computation to optimized lower-precision routines. However, no similar approach exists for the fast Fourier transform (FFT). Here, we propose a method that computes target-precision FFTs using lower-precision FFTs by applying the Ozaki scheme to the cyclic convolution in the Bluestein FFT. The split component convolutions are computed exactly using the number theoretic transform (NTT), an FFT over a finite field, instead of floating-point FFTs, combined with the Chinese remainder theorem. We introduce an upper bound on the number of splits and an NTT-domain accumulation strategy to reduce the NTT call count. As a concrete implementation, we implement a double-precision FFT using 32-bit NTTs and confirm reduced relative error compared with those for FFTs based on FFTW and Triple-Single precision arithmetic, with stable error across FFT lengths, at most 96 NTT calls, or 64 NTT calls with NTT-domain accumulation. On an Intel Xeon Platinum 8468 for lengths $n=2^{10}$-$2^{18}$, the execution time is approximately 107-1315$\times$ that of FFTW's double-precision FFT, with NTTs accounting for approximately 80% of the total time.

Computing FFTs at Target Precision Using Lower-Precision FFTs

Abstract

Computing FFTs at Target Precision Using Lower-Precision FFTs

Abstract

Paper Structure

Table of Contents

Figures (13)