Parameter-Efficient Fine-Tuning with Discrete Fourier Transform

Ziqi Gao; Qichao Wang; Aochuan Chen; Zijing Liu; Bingzhe Wu; Liang Chen; Jia Li

Parameter-Efficient Fine-Tuning with Discrete Fourier Transform

Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, Jia Li

TL;DR

This work introduces FourierFT, a parameter-efficient fine-tuning method that represents weight updates in the Fourier domain by learning a small set of spectral coefficients on a shared spectral-entry matrix. By recovering the spatial weight changes via an inverse discrete Fourier transform, FourierFT achieves substantial parameter savings—often orders of magnitude smaller than LoRA—while maintaining or improving performance across NLP and CV tasks. The approach demonstrates strong results on GLUE, E2E, instruction tuning for LLaMA-family models, and ViT image classification, with explicit examples like instruction tuning on LLaMA2-7B using only $0.064\mathrm{M}$ parameters compared to LoRA's $33.5\mathrm{M}$. Overall, FourierFT offers a scalable, memory-efficient alternative for adapting large foundation models, enabling broader on-device customization and multi-task specialization with minimal storage overhead.

Abstract

Low-rank adaptation~(LoRA) has recently gained much interest in fine-tuning foundation models. It effectively reduces the number of trainable parameters by incorporating low-rank matrices $A$ and $B$ to represent the weight change, i.e., $ΔW=BA$. Despite LoRA's progress, it faces storage challenges when handling extensive customization adaptations or larger base models. In this work, we aim to further compress trainable parameters by enjoying the powerful expressiveness of the Fourier transform. Specifically, we introduce FourierFT, which treats $ΔW$ as a matrix in the spatial domain and learns only a small fraction of its spectral coefficients. With the trained spectral coefficients, we implement the inverse discrete Fourier transform to recover $ΔW$. Empirically, our FourierFT method shows comparable or better performance with fewer parameters than LoRA on various tasks, including natural language understanding, natural language generation, instruction tuning, and image classification. For example, when performing instruction tuning on the LLaMA2-7B model, FourierFT surpasses LoRA with only 0.064M trainable parameters, compared to LoRA's 33.5M. Our code is released at \url{https://github.com/Chaos96/fourierft}.

Parameter-Efficient Fine-Tuning with Discrete Fourier Transform

TL;DR

parameters compared to LoRA's

. Overall, FourierFT offers a scalable, memory-efficient alternative for adapting large foundation models, enabling broader on-device customization and multi-task specialization with minimal storage overhead.

Abstract

Low-rank adaptation~(LoRA) has recently gained much interest in fine-tuning foundation models. It effectively reduces the number of trainable parameters by incorporating low-rank matrices

and

to represent the weight change, i.e.,

. Despite LoRA's progress, it faces storage challenges when handling extensive customization adaptations or larger base models. In this work, we aim to further compress trainable parameters by enjoying the powerful expressiveness of the Fourier transform. Specifically, we introduce FourierFT, which treats

as a matrix in the spatial domain and learns only a small fraction of its spectral coefficients. With the trained spectral coefficients, we implement the inverse discrete Fourier transform to recover

. Empirically, our FourierFT method shows comparable or better performance with fewer parameters than LoRA on various tasks, including natural language understanding, natural language generation, instruction tuning, and image classification. For example, when performing instruction tuning on the LLaMA2-7B model, FourierFT surpasses LoRA with only 0.064M trainable parameters, compared to LoRA's 33.5M. Our code is released at \url{https://github.com/Chaos96/fourierft}.

Paper Structure (47 sections, 5 equations, 10 figures, 13 tables, 1 algorithm)

This paper contains 47 sections, 5 equations, 10 figures, 13 tables, 1 algorithm.

Introduction
Related Works
Parameter-Efficient Fine-Tuning.
Sparse Fourier Transform in Deep Learning.
Method
Forward Pass
Initialization for the Entry Matrix $E$.
Parameter Summary
Experiments
Baselines.
Natural Language Understanding
Models and Datasets.
Implementation Details.
Results.
Natural Language Generation
...and 32 more sections

Figures (10)

Figure 1: Summary of the performance (y-axis) of fine-tuning methods with different numbers (x-axis) of trainable parameters on NLP (left) and CV (right) tasks. The left side shows the instruction tuning task, where the LLaMA2-7B model is fine-tuned with Alpaca and evaluated by GPT-4. The right side shows the image classification task, where the Vision Transformer (ViT) is fine-tuned and tested on the DTD dataset. Black circles ($\color{black}{\bullet}$) represent the Full Fine-tuning (FF) method. Orange circles ($\color{orange}{\bullet}$) represent LoRA method with $r=\{32,64,128\}$ (left) and $r=\{8,16,32\}$ (right). Blue circles ($\color{blue}{\bullet}$) represent our proposed method with $n=\{1000, 2000\}$ (left) and $n=\{3000, 10000\}$ (right).
Figure 2: Overview of LoRA (left) and our FourierFT (right) method. In LoRA, only low-rank ($r$) matrices $A$ and $B$ are trained. The weight change is represented by their multiplication, i.e., $\Delta W=BA$. For each pre-trained weight $W$, the theoretical number of trainable parameters in LoRA is $r\times(d_{1}+d_{2})$. In FourierFT, we first randomly generate the spectral entry matrix $\mathbb{R}^{2\times n}$, which is shared across all layers to reduce parameter storage requirements. The complete spectral matrix is formed by a trainable coefficient vector $\mathbb{R}^n$ located at selected entries and $0$s at the remaining entries. We obtain the weight change $\Delta W$ by directly performing inverse discrete Fourier transform (IDFT) on the updated spectral matrix. For all $L$ adapted layers, FourierFT needs to store $n\times(2+L)$ parameters.
Figure 3: Visualization of entry sampling probability at different favored central frequencies $f_c$.
Figure 4: Performance on the GLUE benchmark with RoBERTa Base vs. number of trainable parameters (each layer) of LoRA and ours. For all 6 datasets, we apply the setting of $r=\{1,2,4,6,8,15\}$ for LoRA and $n=\{50,100,200,1000,6144,12288\}$.
Figure 5: Results on 4 datasets in GLUE with different $f_c$ values.
...and 5 more figures

Parameter-Efficient Fine-Tuning with Discrete Fourier Transform

TL;DR

Abstract

Parameter-Efficient Fine-Tuning with Discrete Fourier Transform

Authors

TL;DR

Abstract

Table of Contents

Figures (10)