Table of Contents
Fetching ...

FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models

Juyong Jiang, Fan Wang, Hong Qi, Sunghun Kim, Jing Tang

Abstract

Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting large language models (LLMs) under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task interference and limited parameter budgets lead to representational deficiency. While recent approaches incorporate mixture-of-experts (MoE) to alleviate these issues, they predominantly operate in the spatial domain, which may introduce structural redundancy and parameter overhead. To overcome these limitations, we reformulate adaptation in the spectral domain. Our spectral analysis reveals that different tasks exhibit distinct frequency energy distributions, and that LLM layers display heterogeneous frequency sensitivities. Motivated by these insights, we propose FourierMoE, which integrates the MoE architecture with the inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Specifically, FourierMoE employs a frequency-adaptive router to dispatch tokens to experts specialized in distinct frequency bands. Each expert learns a set of conjugate-symmetric complex coefficients, preserving complete phase and amplitude information while theoretically guaranteeing lossless IDFT reconstruction into real-valued spatial weights. Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters. These results highlight the promise of spectral-domain expert adaptation as an effective and parameter-efficient paradigm for LLM fine-tuning.

FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models

Abstract

Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting large language models (LLMs) under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task interference and limited parameter budgets lead to representational deficiency. While recent approaches incorporate mixture-of-experts (MoE) to alleviate these issues, they predominantly operate in the spatial domain, which may introduce structural redundancy and parameter overhead. To overcome these limitations, we reformulate adaptation in the spectral domain. Our spectral analysis reveals that different tasks exhibit distinct frequency energy distributions, and that LLM layers display heterogeneous frequency sensitivities. Motivated by these insights, we propose FourierMoE, which integrates the MoE architecture with the inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Specifically, FourierMoE employs a frequency-adaptive router to dispatch tokens to experts specialized in distinct frequency bands. Each expert learns a set of conjugate-symmetric complex coefficients, preserving complete phase and amplitude information while theoretically guaranteeing lossless IDFT reconstruction into real-valued spatial weights. Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters. These results highlight the promise of spectral-domain expert adaptation as an effective and parameter-efficient paradigm for LLM fine-tuning.

Paper Structure

This paper contains 37 sections, 7 theorems, 20 equations, 6 figures, 14 tables, 1 algorithm.

Key Result

Proposition 3.1

Restricting spectral coefficients to $\mathbb{R}$ (i.e., $b_{u,v}=0$) forces the phase $\Phi_{u,v} = \operatorname{atan2}(b, a)$ to be either $0$ or $\pi$. This constrains the resulting spatial signal to be an even function (symmetric around the origin), rendering the model incapable of representing

Figures (6)

  • Figure 1: Spectral analysis across layers and tasks. (Left) The power spectral density of the RoBERTa-large weights shows layer-wise differences, with early layers exhibiting clear high-frequency spikes and deeper layers showing a progressively smoother spectrum. (Right) For different GLUE tasks (CoLA, SST-2, QQP, and MRPC), the spectra of hidden representations from the eighth layer of RoBERTa-large display distinct frequency energy distributions, revealing task-specific preferences. These observations suggest that effective adaptation requires frequency-specific modulation tailored to different layers and downstream tasks.
  • Figure 2: The overall framework of FourierMoE, which reparameterizes LLM weight updates $\Delta \mathbf{W}$ in the spectral domain. A frequency-adaptive router $\mathcal{G}_{\Phi}(x)$ dynamically assigns tokens to experts specialized in distinct frequency bands, which mitigates task interference via parameter isolation. Each expert learns conjugate-symmetric complex coefficients, enabling complete spectral representation while theoretically guaranteeing real-valued weight updates after IDFT.
  • Figure 3: Component ablation study of FourierMoE on Cars, DTD, and SUN397. Accuracy is reported to assess the contribution of each component. Removing any component results in a performance drop compared to the full FourierMoE implementation.
  • Figure 4: Expert scaling analysis on the MRPC, QNLI, and RTE datasets. We report the accuracy scores for each dataset. Left: Impact of the trainable coefficient count $n$ per expert. Center: Scalability with respect to the total number of experts $Z$ (fixed activation $k=2$). Right: Effect of the active expert count $k$ given a fixed total pool size of $Z=8$.
  • Figure 5: Visualization of expert-wise coefficient distributions across different layers of the fine-tuned model. Each expert's coefficients consistently exhibit concentrations across layers, indicating stable specialization.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Proposition 3.1: Phase-Amplitude Completeness
  • Theorem 3.2: Conjugate Symmetry Condition
  • proof
  • Corollary 3.3: Truncation Error Bound
  • Lemma 8.1: Rank-1 Property of Fourier Kernels
  • proof
  • Theorem 8.2: Spectral Sparsity-Rank Inequality
  • proof
  • Corollary 8.3: Router as a Rank Selector
  • Proposition 8.4: Global Information Flow