Table of Contents
Fetching ...

TurboFNO: High-Performance Fourier Neural Operator with Fused FFT-GEMM-iFFT on GPU

Shixun Wu, Yujia Zhai, Huangliang Dai, Hairui Zhao, Yue Zhu, Haiyang Hu, Zizhong Chen

TL;DR

TurboFNO presents the first fully fused GPU kernel that combines FFT, CGEMM, and iFFT for Fourier Neural Operators. By introducing built-in FFT pruning, truncation, zero-padding, and a dataflow-aligned fusion strategy, the approach reduces memory traffic and kernel-launch overhead, outperforming cuFFT/cuBLAS and PyTorch baselines on NVIDIA A100 GPUs. The method includes custom CGEMM and FFT kernels with warp-level swizzling to maximize shared memory utilization and end-to-end fusion. Experimental results demonstrate substantial speedups, with average gains around 67% and peaks up to 150% across 1D and 2D FNO workloads, validating the practical impact for large-scale scientific simulations. These findings highlight memory-transaction reduction through architecture-aware co-design as a key driver of performance in spectral neural operators.

Abstract

Fourier Neural Operators (FNO) are widely used for learning partial differential equation solution operators. However, FNO lacks architecture-aware optimizations,with its Fourier layers executing FFT, filtering, GEMM, zero padding, and iFFT as separate stages, incurring multiple kernel launches and significant global memory traffic. We propose TurboFNO, the first fully fused FFT-GEMM-iFFT GPU kernel with built-in FFT optimizations. We first develop FFT and GEMM kernels from scratch, achieving performance comparable to or faster than the closed-source SOTA cuBLAS and cuFFT. Additionally, our FFT kernel integrates a built-in high-frequency truncation, input zero-padding, and pruning feature to avoid additional memory copy kernels. To fuse the FFT and GEMM workloads, we propose an FFT variant in which a single thread block iterates over the hidden dimension, aligning with the $k$-loop in GEMM. Additionally, we design two shared memory swizzling patterns to achieve 100\% memory bank utilization when forwarding FFT output to GEMM and enabling the iFFT to retrieve GEMM results directly from shared memory.Experimental result on an NVIDIA A100 GPU shows TurboFNO outperforms PyTorch, cuBLAS, and cuFFT by up to 150\%.

TurboFNO: High-Performance Fourier Neural Operator with Fused FFT-GEMM-iFFT on GPU

TL;DR

TurboFNO presents the first fully fused GPU kernel that combines FFT, CGEMM, and iFFT for Fourier Neural Operators. By introducing built-in FFT pruning, truncation, zero-padding, and a dataflow-aligned fusion strategy, the approach reduces memory traffic and kernel-launch overhead, outperforming cuFFT/cuBLAS and PyTorch baselines on NVIDIA A100 GPUs. The method includes custom CGEMM and FFT kernels with warp-level swizzling to maximize shared memory utilization and end-to-end fusion. Experimental results demonstrate substantial speedups, with average gains around 67% and peaks up to 150% across 1D and 2D FNO workloads, validating the practical impact for large-scale scientific simulations. These findings highlight memory-transaction reduction through architecture-aware co-design as a key driver of performance in spectral neural operators.

Abstract

Fourier Neural Operators (FNO) are widely used for learning partial differential equation solution operators. However, FNO lacks architecture-aware optimizations,with its Fourier layers executing FFT, filtering, GEMM, zero padding, and iFFT as separate stages, incurring multiple kernel launches and significant global memory traffic. We propose TurboFNO, the first fully fused FFT-GEMM-iFFT GPU kernel with built-in FFT optimizations. We first develop FFT and GEMM kernels from scratch, achieving performance comparable to or faster than the closed-source SOTA cuBLAS and cuFFT. Additionally, our FFT kernel integrates a built-in high-frequency truncation, input zero-padding, and pruning feature to avoid additional memory copy kernels. To fuse the FFT and GEMM workloads, we propose an FFT variant in which a single thread block iterates over the hidden dimension, aligning with the -loop in GEMM. Additionally, we design two shared memory swizzling patterns to achieve 100\% memory bank utilization when forwarding FFT output to GEMM and enabling the iFFT to retrieve GEMM results directly from shared memory.Experimental result on an NVIDIA A100 GPU shows TurboFNO outperforms PyTorch, cuBLAS, and cuFFT by up to 150\%.

Paper Structure

This paper contains 27 sections, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Fourier Neural Operator and TurboFNO
  • Figure 2: Fourier Nerual Operator Dataflow
  • Figure 3: CGEMM and FFT
  • Figure 4: FFT Global Memory
  • Figure 5: FFT Prune
  • ...and 14 more figures