Table of Contents
Fetching ...

Computing FFTs at Target Precision Using Lower-Precision FFTs

Shota Kawakami, Daisuke Takahashi

Abstract

Modern processors deliver higher throughput for lower-precision arithmetic than for higher-precision arithmetic. For matrix multiplication, the Ozaki scheme exploits this performance gap by splitting the inputs into lower-precision components and delegating the computation to optimized lower-precision routines. However, no similar approach exists for the fast Fourier transform (FFT). Here, we propose a method that computes target-precision FFTs using lower-precision FFTs by applying the Ozaki scheme to the cyclic convolution in the Bluestein FFT. The split component convolutions are computed exactly using the number theoretic transform (NTT), an FFT over a finite field, instead of floating-point FFTs, combined with the Chinese remainder theorem. We introduce an upper bound on the number of splits and an NTT-domain accumulation strategy to reduce the NTT call count. As a concrete implementation, we implement a double-precision FFT using 32-bit NTTs and confirm reduced relative error compared with those for FFTs based on FFTW and Triple-Single precision arithmetic, with stable error across FFT lengths, at most 96 NTT calls, or 64 NTT calls with NTT-domain accumulation. On an Intel Xeon Platinum 8468 for lengths $n=2^{10}$-$2^{18}$, the execution time is approximately 107-1315$\times$ that of FFTW's double-precision FFT, with NTTs accounting for approximately 80% of the total time.

Computing FFTs at Target Precision Using Lower-Precision FFTs

Abstract

Modern processors deliver higher throughput for lower-precision arithmetic than for higher-precision arithmetic. For matrix multiplication, the Ozaki scheme exploits this performance gap by splitting the inputs into lower-precision components and delegating the computation to optimized lower-precision routines. However, no similar approach exists for the fast Fourier transform (FFT). Here, we propose a method that computes target-precision FFTs using lower-precision FFTs by applying the Ozaki scheme to the cyclic convolution in the Bluestein FFT. The split component convolutions are computed exactly using the number theoretic transform (NTT), an FFT over a finite field, instead of floating-point FFTs, combined with the Chinese remainder theorem. We introduce an upper bound on the number of splits and an NTT-domain accumulation strategy to reduce the NTT call count. As a concrete implementation, we implement a double-precision FFT using 32-bit NTTs and confirm reduced relative error compared with those for FFTs based on FFTW and Triple-Single precision arithmetic, with stable error across FFT lengths, at most 96 NTT calls, or 64 NTT calls with NTT-domain accumulation. On an Intel Xeon Platinum 8468 for lengths -, the execution time is approximately 107-1315 that of FFTW's double-precision FFT, with NTTs accounting for approximately 80% of the total time.

Paper Structure

This paper contains 23 sections, 57 equations, 13 figures, 1 table, 8 algorithms.

Figures (13)

  • Figure 1: Comparison of split width $\alpha$ (bits per split component) for various convolution methods in single precision. A larger $\alpha$ implies fewer splits and thus lower computational cost.
  • Figure 2: Maximum relative error of proposed double-precision Bluestein FFT ($(K,\,L)=(\infty,\,1)$) compared with that of other double-precision FFT implementations.
  • Figure 3: Relative error of proposed double-precision Bluestein FFT ($(K,\,L)=(\infty,\,1)$) compared with that of other double-precision FFT implementations.
  • Figure 4: Number of splits of $\bm{x}'$ and $\bm{\omega}^{*}$ in proposed double-precision Bluestein FFT ($(K,\,L)=(\infty,\,1)$).
  • Figure 5: Total number of 32-bit NTTs and inverse NTTs in proposed double-precision Bluestein FFT ($(K,\,L)=(\infty,\,1)$).
  • ...and 8 more figures