Table of Contents
Fetching ...

Federated Stochastic Minimax Optimization under Heavy-Tailed Noises

Xinwen Zhang, Hongchang Gao

TL;DR

This work tackles federated stochastic minimax optimization under heavy-tailed gradient noises, where the objective $f(x,y)=\frac{1}{N}\sum_{n=1}^N f^{(n)}(x,y)$ is nonconvex in $x$ and satisfies the Polyak-Lojasiewicz condition in $y$. It introduces two algorithms, Fed-NSGDA-M and FedMuon-DA, that address heavy-tailed noise without gradient clipping by leveraging gradient normalization, control variates, and Muon-based local updates. The authors establish a convergence rate of $O\left(\frac{1}{(TNp)^{(s-1)/(2s)}}\right)$ for both methods under mild assumptions with $s\in(1,2]$, achieving linear speedup in the number of clients $N$ and accommodating data heterogeneity without explicit bounds. Empirical evaluation on deep AUC maximization with imbalanced federated data demonstrates robustness and superior performance over state-of-the-art baselines in both homogeneous and heterogeneous settings, underscoring practical impact for large-scale, privacy-preserving learning. This work provides the first rigorous guarantees for federated minimax optimization under heavy-tailed noise and suggests clipping-free, robust optimization strategies for distributed nonconvex-PL problems.

Abstract

Heavy-tailed noise has attracted growing attention in nonconvex stochastic optimization, as numerous empirical studies suggest it offers a more realistic assumption than standard bounded variance assumption. In this work, we investigate nonconvex-PL minimax optimization under heavy-tailed gradient noise in federated learning. We propose two novel algorithms: Fed-NSGDA-M, which integrates normalized gradients, and FedMuon-DA, which leverages the Muon optimizer for local updates. Both algorithms are designed to effectively address heavy-tailed noise in federated minimax optimization, under a milder condition. We theoretically establish that both algorithms achieve a convergence rate of $O({1}/{(TNp)^{\frac{s-1}{2s}}})$. To the best of our knowledge, these are the first federated minimax optimization algorithms with rigorous theoretical guarantees under heavy-tailed noise. Extensive experiments further validate their effectiveness.

Federated Stochastic Minimax Optimization under Heavy-Tailed Noises

TL;DR

This work tackles federated stochastic minimax optimization under heavy-tailed gradient noises, where the objective is nonconvex in and satisfies the Polyak-Lojasiewicz condition in . It introduces two algorithms, Fed-NSGDA-M and FedMuon-DA, that address heavy-tailed noise without gradient clipping by leveraging gradient normalization, control variates, and Muon-based local updates. The authors establish a convergence rate of for both methods under mild assumptions with , achieving linear speedup in the number of clients and accommodating data heterogeneity without explicit bounds. Empirical evaluation on deep AUC maximization with imbalanced federated data demonstrates robustness and superior performance over state-of-the-art baselines in both homogeneous and heterogeneous settings, underscoring practical impact for large-scale, privacy-preserving learning. This work provides the first rigorous guarantees for federated minimax optimization under heavy-tailed noise and suggests clipping-free, robust optimization strategies for distributed nonconvex-PL problems.

Abstract

Heavy-tailed noise has attracted growing attention in nonconvex stochastic optimization, as numerous empirical studies suggest it offers a more realistic assumption than standard bounded variance assumption. In this work, we investigate nonconvex-PL minimax optimization under heavy-tailed gradient noise in federated learning. We propose two novel algorithms: Fed-NSGDA-M, which integrates normalized gradients, and FedMuon-DA, which leverages the Muon optimizer for local updates. Both algorithms are designed to effectively address heavy-tailed noise in federated minimax optimization, under a milder condition. We theoretically establish that both algorithms achieve a convergence rate of . To the best of our knowledge, these are the first federated minimax optimization algorithms with rigorous theoretical guarantees under heavy-tailed noise. Extensive experiments further validate their effectiveness.

Paper Structure

This paper contains 25 sections, 17 theorems, 87 equations, 3 figures.

Key Result

Theorem 1

Given Assumptions assumption:smooth-assumption:ht_variance, by setting we obtain

Figures (3)

  • Figure 1: Testing AUC curves over epochs, $p = 4$, imbalance ratio $r=0.1$, i.i.d scenario.
  • Figure 2: Testing AUC curves over epochs, $p = 16$, imbalance ratio $r=0.1$, i.i.d scenario.
  • Figure 3: Testing AUC curves over epochs, $p = 4$, non-i.i.d scenario.

Theorems & Definitions (32)

  • Theorem 1
  • Remark 4.1
  • Remark 4.2
  • Remark 4.3
  • Lemma 4.1
  • Theorem 2
  • Remark 4.4
  • Lemma 4.2
  • Lemma A.1
  • Lemma A.2
  • ...and 22 more