Table of Contents
Fetching ...

On Provable Benefits of Muon in Federated Learning

Xinwen Zhang, Hongchang Gao

TL;DR

This work extends the Muon optimizer to federated learning by proposing FedMuon, a method that updates local matrices with orthonormal momentum and communicates periodically. It proves nonconvex convergence under both bounded and heavy-tailed gradient noise, showing a parameter-free learning rate and linear speedup in the number of workers $K$, with a convergence rate of $O((KT)^{-1/4})$ under regular noise and $O((KT)^{-(p-1)/(2p)})$ under tail index $p\in(1,2]$. Crucially, FedMuon achieves these guarantees without gradient clipping, and the analysis leverages the bounded update direction of the orthonormalized momentum to avoid $L$-dependent learning-rate constraints. Empirically, FedMuon outperforms classical FL baselines across CNNs, Transformers, and RNNs on CIFAR and text tasks, with stronger gains on transformer architectures and under data heterogeneity.

Abstract

The recently introduced optimizer, Muon, has gained increasing attention due to its superior performance across a wide range of applications. However, its effectiveness in federated learning remains unexplored. To address this gap, this paper investigates the performance of Muon in the federated learning setting. Specifically, we propose a new algorithm, FedMuon, and establish its convergence rate for nonconvex problems. Our theoretical analysis reveals multiple favorable properties of FedMuon. In particular, due to its orthonormalized update direction, the learning rate of FedMuon is independent of problem-specific parameters, and, importantly, it can naturally accommodate heavy-tailed noise. The extensive experiments on a variety of neural network architectures validate the effectiveness of the proposed algorithm.

On Provable Benefits of Muon in Federated Learning

TL;DR

This work extends the Muon optimizer to federated learning by proposing FedMuon, a method that updates local matrices with orthonormal momentum and communicates periodically. It proves nonconvex convergence under both bounded and heavy-tailed gradient noise, showing a parameter-free learning rate and linear speedup in the number of workers , with a convergence rate of under regular noise and under tail index . Crucially, FedMuon achieves these guarantees without gradient clipping, and the analysis leverages the bounded update direction of the orthonormalized momentum to avoid -dependent learning-rate constraints. Empirically, FedMuon outperforms classical FL baselines across CNNs, Transformers, and RNNs on CIFAR and text tasks, with stronger gains on transformer architectures and under data heterogeneity.

Abstract

The recently introduced optimizer, Muon, has gained increasing attention due to its superior performance across a wide range of applications. However, its effectiveness in federated learning remains unexplored. To address this gap, this paper investigates the performance of Muon in the federated learning setting. Specifically, we propose a new algorithm, FedMuon, and establish its convergence rate for nonconvex problems. Our theoretical analysis reveals multiple favorable properties of FedMuon. In particular, due to its orthonormalized update direction, the learning rate of FedMuon is independent of problem-specific parameters, and, importantly, it can naturally accommodate heavy-tailed noise. The extensive experiments on a variety of neural network architectures validate the effectiveness of the proposed algorithm.

Paper Structure

This paper contains 24 sections, 20 theorems, 53 equations, 11 figures, 3 tables, 1 algorithm.

Key Result

Theorem 5.1

Given Assumptions assumption:smoothness, assumption:regular-noise, assumption:heterogeneity, when $0<\beta<1$, FedMuon in Algorithm alg:fedmuon can achieve the following convergence upper bound:

Figures (11)

  • Figure 1: CIFAR-10 on ResNet-18 (period = 4).
  • Figure 2: CIFAR-100 on ResNet-18 (period = 4).
  • Figure 3: CIFAR-10 on ViT (period = 4).
  • Figure 4: CIFAR-10 on ResNet-18 (period = 4, $Dir(0.5)$).
  • Figure 5: CIFAR-10 on ResNet-18 (period = 16).
  • ...and 6 more figures

Theorems & Definitions (30)

  • Theorem 5.1
  • Corollary 5.2
  • Remark 5.3
  • Corollary 5.4
  • Theorem 5.5
  • Corollary 5.6
  • Remark 5.7
  • Corollary 5.8
  • Lemma 6.1
  • Lemma 6.2
  • ...and 20 more