Memory-Efficient Training with In-Place FFT Implementation
Xinyu Ding, Bangtian Liu, Siyu Liao, Zhongfeng Wang
TL;DR
The paper tackles the memory bottleneck in training large neural models by introducing rdFFT, a real-domain, fully in-place FFT that eliminates intermediate allocations. It leverages conjugate symmetry and a novel memory layout to squeeze the conventional $N+2$ real storage of rFFT into $N$ real slots, enabling true in-place forward and backward passes. The approach is integrated with circulant-matrix training and demonstrates substantial memory savings during both single-layer and full-model fine-tuning on models like LLaMA2-7B and RoBERTa-large, while preserving numerical accuracy and competitive runtime. This in-place, lossless frequency-domain operator promises practical improvements for memory-constrained training and opens avenues for hardware-aware optimizations and broader structured-transformations.
Abstract
Fast Fourier Transforms (FFT) are widely used to reduce memory and computational costs in deep learning. However, existing implementations, including standard FFT and real FFT (rFFT), cannot achieve true in-place computation. In particular, rFFT maps an input of size n to a complex output of size n/2+1, causing dimensional mismatch and requiring additional memory allocation. We propose the first real-domain, fully in-place FFT framework (rdFFT) that preserves input-output memory space consistency. By leveraging butterfly operation symmetry and conjugate properties in the frequency domain, we design an implicit complex encoding scheme that eliminates intermediate cache usage entirely. Experiments on multiple natural language understanding tasks demonstrate the method effectiveness in reducing training memory cost, offering a promising direction for frequency-domain lightweight adaptation.
