Table of Contents
Fetching ...

Taming the Instability: A Robust Second-Order Optimizer for Federated Learning over Non-IID Data

Yuanqiao Zhang, Tiantian He, Yuan Gao, Yixin Wang, Yew-Soon Ong, Maoguo Gong, A. K. Qin, Hui Li

Abstract

In this paper, we present Federated Robust Curvature Optimization (FedRCO), a novel second-order optimization framework designed to improve convergence speed and reduce communication cost in Federated Learning systems under statistical heterogeneity. Existing second-order optimization methods are often computationally expensive and numerically unstable in distributed settings. In contrast, FedRCO addresses these challenges by integrating an efficient approximate curvature optimizer with a provable stability mechanism. Specifically, FedRCO incorporates three key components: (1) a Gradient Anomaly Monitor that detects and mitigates exploding gradients in real-time, (2) a Fail-Safe Resilience protocol that resets optimization states upon numerical instability, and (3) a Curvature-Preserving Adaptive Aggregation strategy that safely integrates global knowledge without erasing the local curvature geometry. Theoretical analysis shows that FedRCO can effectively mitigate instability and prevent unbounded updates while preserving optimization efficiency. Extensive experiments show that FedRCO achieves superior robustness against diverse non-IID scenarios while achieving higher accuracy and faster convergence than both state-of-the-art first-order and second-order methods.

Taming the Instability: A Robust Second-Order Optimizer for Federated Learning over Non-IID Data

Abstract

In this paper, we present Federated Robust Curvature Optimization (FedRCO), a novel second-order optimization framework designed to improve convergence speed and reduce communication cost in Federated Learning systems under statistical heterogeneity. Existing second-order optimization methods are often computationally expensive and numerically unstable in distributed settings. In contrast, FedRCO addresses these challenges by integrating an efficient approximate curvature optimizer with a provable stability mechanism. Specifically, FedRCO incorporates three key components: (1) a Gradient Anomaly Monitor that detects and mitigates exploding gradients in real-time, (2) a Fail-Safe Resilience protocol that resets optimization states upon numerical instability, and (3) a Curvature-Preserving Adaptive Aggregation strategy that safely integrates global knowledge without erasing the local curvature geometry. Theoretical analysis shows that FedRCO can effectively mitigate instability and prevent unbounded updates while preserving optimization efficiency. Extensive experiments show that FedRCO achieves superior robustness against diverse non-IID scenarios while achieving higher accuracy and faster convergence than both state-of-the-art first-order and second-order methods.

Paper Structure

This paper contains 41 sections, 9 theorems, 90 equations, 26 figures, 3 tables.

Key Result

Proposition 3.1

(Rank Deficiency) When batch size $B \ll d$, the empirical FIM $\hat{\boldsymbol{F}}_c$ is rank-deficient, allowing sampling noise in the null space to induce unbounded updates (detailed in Appendix A.2).

Figures (26)

  • Figure 1: The experimental results averaged across all participating clients on test accuracy, training accuracy, and training loss on CIFAR-10 with $Dir(\alpha)=0.1$, client number 100, party ratio 0.8.
  • Figure 2: (a), The proportion of time spent on Matrix Inversion and Communication relative to the total training time. (b) The experimental results on average global accuracy versus wall-clock time. The horizontal axis represents time.
  • Figure 3: Impact results of Inversion Frequency ($T_{inv}$) measured by communication round and time.
  • Figure 4: Visualization of gradient stability. The plot compares the gradient trajectories with (red) and without (blue/orange) the Gradient Monitor.
  • Figure 5: The distribution example. The first figure is the Dirichlet distribution on 100 clients with $Dir(\alpha)=0.1$, and the second figure is the Pathological distribution on 100 clients with 2 classes per client.
  • ...and 21 more figures

Theorems & Definitions (16)

  • Proposition 3.1
  • Proposition 3.2
  • Proposition 3.3
  • Proposition 3.4
  • Theorem 5.1
  • Theorem 5.2
  • Theorem 5.3
  • Theorem 5.4
  • proof
  • proof
  • ...and 6 more