Federated Stochastic Minimax Optimization under Heavy-Tailed Noises
Xinwen Zhang, Hongchang Gao
TL;DR
This work tackles federated stochastic minimax optimization under heavy-tailed gradient noises, where the objective $f(x,y)=\frac{1}{N}\sum_{n=1}^N f^{(n)}(x,y)$ is nonconvex in $x$ and satisfies the Polyak-Lojasiewicz condition in $y$. It introduces two algorithms, Fed-NSGDA-M and FedMuon-DA, that address heavy-tailed noise without gradient clipping by leveraging gradient normalization, control variates, and Muon-based local updates. The authors establish a convergence rate of $O\left(\frac{1}{(TNp)^{(s-1)/(2s)}}\right)$ for both methods under mild assumptions with $s\in(1,2]$, achieving linear speedup in the number of clients $N$ and accommodating data heterogeneity without explicit bounds. Empirical evaluation on deep AUC maximization with imbalanced federated data demonstrates robustness and superior performance over state-of-the-art baselines in both homogeneous and heterogeneous settings, underscoring practical impact for large-scale, privacy-preserving learning. This work provides the first rigorous guarantees for federated minimax optimization under heavy-tailed noise and suggests clipping-free, robust optimization strategies for distributed nonconvex-PL problems.
Abstract
Heavy-tailed noise has attracted growing attention in nonconvex stochastic optimization, as numerous empirical studies suggest it offers a more realistic assumption than standard bounded variance assumption. In this work, we investigate nonconvex-PL minimax optimization under heavy-tailed gradient noise in federated learning. We propose two novel algorithms: Fed-NSGDA-M, which integrates normalized gradients, and FedMuon-DA, which leverages the Muon optimizer for local updates. Both algorithms are designed to effectively address heavy-tailed noise in federated minimax optimization, under a milder condition. We theoretically establish that both algorithms achieve a convergence rate of $O({1}/{(TNp)^{\frac{s-1}{2s}}})$. To the best of our knowledge, these are the first federated minimax optimization algorithms with rigorous theoretical guarantees under heavy-tailed noise. Extensive experiments further validate their effectiveness.
