Table of Contents
Fetching ...

Memory-Efficient Training with In-Place FFT Implementation

Xinyu Ding, Bangtian Liu, Siyu Liao, Zhongfeng Wang

TL;DR

The paper tackles the memory bottleneck in training large neural models by introducing rdFFT, a real-domain, fully in-place FFT that eliminates intermediate allocations. It leverages conjugate symmetry and a novel memory layout to squeeze the conventional $N+2$ real storage of rFFT into $N$ real slots, enabling true in-place forward and backward passes. The approach is integrated with circulant-matrix training and demonstrates substantial memory savings during both single-layer and full-model fine-tuning on models like LLaMA2-7B and RoBERTa-large, while preserving numerical accuracy and competitive runtime. This in-place, lossless frequency-domain operator promises practical improvements for memory-constrained training and opens avenues for hardware-aware optimizations and broader structured-transformations.

Abstract

Fast Fourier Transforms (FFT) are widely used to reduce memory and computational costs in deep learning. However, existing implementations, including standard FFT and real FFT (rFFT), cannot achieve true in-place computation. In particular, rFFT maps an input of size n to a complex output of size n/2+1, causing dimensional mismatch and requiring additional memory allocation. We propose the first real-domain, fully in-place FFT framework (rdFFT) that preserves input-output memory space consistency. By leveraging butterfly operation symmetry and conjugate properties in the frequency domain, we design an implicit complex encoding scheme that eliminates intermediate cache usage entirely. Experiments on multiple natural language understanding tasks demonstrate the method effectiveness in reducing training memory cost, offering a promising direction for frequency-domain lightweight adaptation.

Memory-Efficient Training with In-Place FFT Implementation

TL;DR

The paper tackles the memory bottleneck in training large neural models by introducing rdFFT, a real-domain, fully in-place FFT that eliminates intermediate allocations. It leverages conjugate symmetry and a novel memory layout to squeeze the conventional real storage of rFFT into real slots, enabling true in-place forward and backward passes. The approach is integrated with circulant-matrix training and demonstrates substantial memory savings during both single-layer and full-model fine-tuning on models like LLaMA2-7B and RoBERTa-large, while preserving numerical accuracy and competitive runtime. This in-place, lossless frequency-domain operator promises practical improvements for memory-constrained training and opens avenues for hardware-aware optimizations and broader structured-transformations.

Abstract

Fast Fourier Transforms (FFT) are widely used to reduce memory and computational costs in deep learning. However, existing implementations, including standard FFT and real FFT (rFFT), cannot achieve true in-place computation. In particular, rFFT maps an input of size n to a complex output of size n/2+1, causing dimensional mismatch and requiring additional memory allocation. We propose the first real-domain, fully in-place FFT framework (rdFFT) that preserves input-output memory space consistency. By leveraging butterfly operation symmetry and conjugate properties in the frequency domain, we design an implicit complex encoding scheme that eliminates intermediate cache usage entirely. Experiments on multiple natural language understanding tasks demonstrate the method effectiveness in reducing training memory cost, offering a promising direction for frequency-domain lightweight adaptation.

Paper Structure

This paper contains 22 sections, 2 theorems, 11 equations, 2 figures, 4 tables.

Key Result

Theorem 1

Let $x \in \mathbb{R}^N$ be a real-valued sequence, and let $y_k$ denote its Fast Fourier Transform (FFT). Then the FFT output satisfies the conjugate symmetry property:

Figures (2)

  • Figure 1: Overview of our method and its differences from standard FFT and rFFT implementations. The green section depicts the Butterfly Operation Diagram using a 16-point FFT (16-FFT) as an example. The orange section illustrates the storage formats of different FFT implementations, shown on an 8-point FFT (8-FFT). Two representative butterfly computation paths in the 16-FFT are highlighted in red, and expanded into: (i) the blue section showing Complex-to-Complex FFT and IFFT operations, and (ii) the red section showing Float-to-Float FFT and IFFT operations—both derived from the red paths in the 16-FFT diagram. This figure summarizes the key computational flows and memory layouts addressed by our in-place real-domain FFT design.
  • Figure 2: Memory breakdown during single-layer fine-tuning with hidden dimension $D=4096$, under two batch sizes: (a) $B = 1$ and (b) $B = 256$. Intermediate tensors are allocated during the forward pass, while gradients appear in the backward pass. This illustrates how batch size impacts memory allocation for activations and gradients.

Theorems & Definitions (3)

  • Theorem 1: Conjugate Symmetry of Real FFT conj_rDFT
  • Proposition 1
  • proof