Table of Contents
Fetching ...

FedMuon: Accelerating Federated Learning with Matrix Orthogonalization

Junkang Liu, Fanhua Shang, Junchao Zhou, Hongying Liu, Yuanyuan Liu, Jin Liu

TL;DR

This work tackles the bottleneck in federated learning caused by limited communication rounds by introducing FedMuon, a structure-aware optimizer that treats model updates as matrices. FedMuon extends the Muon optimizer with two mechanisms—local-global alignment and momentum aggregation—to mitigate non-IID drift, and uses SVD-based compression for efficient momentum communication. The authors prove a linear speedup convergence rate that is robust to data heterogeneity and demonstrate strong empirical gains on vision and language tasks, including Transformer-based models, with faster convergence and reduced communication. Practically, FedMuon offers a principled path to scalable, communication-efficient federated training for large-scale foundation-model fine-tuning and distributed training. The combination of matrix-orthogonalized updates, cross-round momentum sharing, and low-rank state sharing yields improved stability and performance under challenging non-IID settings.

Abstract

The core bottleneck of Federated Learning (FL) lies in the communication rounds. That is, how to achieve more effective local updates is crucial for reducing communication rounds. Existing FL methods still primarily use element-wise local optimizers (Adam/SGD), neglecting the geometric structure of the weight matrices. This often leads to the amplification of pathological directions in the weights during local updates, leading deterioration in the condition number and slow convergence. Therefore, we introduce the Muon optimizer in local, which has matrix orthogonalization to optimize matrix-structured parameters. Experimental results show that, in IID setting, Local Muon significantly accelerates the convergence of FL and reduces communication rounds compared to Local SGD and Local AdamW. However, in non-IID setting, independent matrix orthogonalization based on the local distributions of each client induces strong client drift. Applying Muon in non-IID FL poses significant challenges: (1) client preconditioner leading to client drift; (2) moment reinitialization. To address these challenges, we propose a novel Federated Muon optimizer (FedMuon), which incorporates two key techniques: (1) momentum aggregation, where clients use the aggregated momentum for local initialization; (2) local-global alignment, where the local gradients are aligned with the global update direction to significantly reduce client drift. Theoretically, we prove that \texttt{FedMuon} achieves a linear speedup convergence rate without the heterogeneity assumption, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. Empirically, we validate the effectiveness of FedMuon on language and vision models. Compared to several baselines, FedMuon significantly reduces communication rounds and improves test accuracy.

FedMuon: Accelerating Federated Learning with Matrix Orthogonalization

TL;DR

This work tackles the bottleneck in federated learning caused by limited communication rounds by introducing FedMuon, a structure-aware optimizer that treats model updates as matrices. FedMuon extends the Muon optimizer with two mechanisms—local-global alignment and momentum aggregation—to mitigate non-IID drift, and uses SVD-based compression for efficient momentum communication. The authors prove a linear speedup convergence rate that is robust to data heterogeneity and demonstrate strong empirical gains on vision and language tasks, including Transformer-based models, with faster convergence and reduced communication. Practically, FedMuon offers a principled path to scalable, communication-efficient federated training for large-scale foundation-model fine-tuning and distributed training. The combination of matrix-orthogonalized updates, cross-round momentum sharing, and low-rank state sharing yields improved stability and performance under challenging non-IID settings.

Abstract

The core bottleneck of Federated Learning (FL) lies in the communication rounds. That is, how to achieve more effective local updates is crucial for reducing communication rounds. Existing FL methods still primarily use element-wise local optimizers (Adam/SGD), neglecting the geometric structure of the weight matrices. This often leads to the amplification of pathological directions in the weights during local updates, leading deterioration in the condition number and slow convergence. Therefore, we introduce the Muon optimizer in local, which has matrix orthogonalization to optimize matrix-structured parameters. Experimental results show that, in IID setting, Local Muon significantly accelerates the convergence of FL and reduces communication rounds compared to Local SGD and Local AdamW. However, in non-IID setting, independent matrix orthogonalization based on the local distributions of each client induces strong client drift. Applying Muon in non-IID FL poses significant challenges: (1) client preconditioner leading to client drift; (2) moment reinitialization. To address these challenges, we propose a novel Federated Muon optimizer (FedMuon), which incorporates two key techniques: (1) momentum aggregation, where clients use the aggregated momentum for local initialization; (2) local-global alignment, where the local gradients are aligned with the global update direction to significantly reduce client drift. Theoretically, we prove that \texttt{FedMuon} achieves a linear speedup convergence rate without the heterogeneity assumption, where is the number of participating clients per round, is the number of local iterations, and is the total number of communication rounds. Empirically, we validate the effectiveness of FedMuon on language and vision models. Compared to several baselines, FedMuon significantly reduces communication rounds and improves test accuracy.

Paper Structure

This paper contains 24 sections, 8 theorems, 41 equations, 6 figures, 7 tables, 3 algorithms.

Key Result

Theorem 1

Under Assumptions smoothness, bounded_stochastic_gradient_I, if we take $g^0=0$,$\beta_1=0,\lambda=0$ then FedMuon converges as follows Here $G_0:=\frac{1}{N} \sum_{i=1}^N\left\|\nabla f_i\left(\boldsymbol{x}^0\right)\right\|^2$,$\Delta=f\left(\boldsymbol{x}^0\right)-f^{\star}$, $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total numbe

Figures (6)

  • Figure 1: (a–f):Block-wise Hessian structure of Transformer parameters and MLP zhang2024adam.
  • Figure 2: (a) shows SVD-based matrix orthogonalization; (b) applies SVD to the momentum matrix $M\!\in\!\mathbb{R}^{d\times d}$, i.e., $M \approx U\Sigma V^{\top}$, and keeps the top-$k$ singular vectors to obtain $U\!\in\!\mathbb{R}^{d\times k}$ and $V\!\in\!\mathbb{R}^{k\times d}$ .
  • Figure 3: Performance of Local SGD, Local AdamW and Local Muon, we carefully tune the learning rate.
  • Figure 4: (a) Analysis on ViT-Tiny with CIFAR-100, showing optimizer state memory, condition number, computation time, and convergence rounds. Local Muon achieves lower memory cost, lower the condition number, and faster convergence. (b) Training loss curves of ViT-Tiny under non-IID.
  • Figure 5: An illustration of FedMuon, which corrects client drift through local-global alignment.
  • ...and 1 more figures

Theorems & Definitions (14)

  • Theorem 1: Convergence for non-convex functions
  • Lemma 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • ...and 4 more