Table of Contents
Fetching ...

Taming Preconditioner Drift: Unlocking the Potential of Second-Order Optimizers for Federated Learning on Non-IID Data

Junkang Liu, Fanhua Shang, Hongying Liu, Jin Liu, Weixin An, Yuanyuan Liu

TL;DR

This work proposes FedPAC, a framework for reliable federated second-order optimization that consistently improves stability and accuracy across vision and language tasks, and provides drift-coupled non-convex convergence guarantees with linear speedup under partial participation.

Abstract

Second-order optimizers can significantly accelerate large-scale training, yet their naive federated variants are often unstable or even diverge on non-IID data. We show that a key culprit is \emph{preconditioner drift}: client-side second-order training induces heterogeneous \emph{curvature-defined geometries} (i.e., preconditioner coordinate systems), and server-side model averaging updates computed under incompatible metrics, corrupting the global descent direction. To address this geometric mismatch, we propose \texttt{FedPAC}, a \emph{preconditioner alignment and correction} framework for reliable federated second-order optimization. \texttt{FedPAC} explicitly decouples parameter aggregation from geometry synchronization by: (i) \textbf{Alignment} (i.e.,aggregating local preconditioners into a global reference and warm-starting clients via global preconditioner); and (ii) \textbf{Correction} (i.e., steering local preconditioned updates using a global preconditioned direction to suppress long-term drift). We provide drift-coupled non-convex convergence guarantees with linear speedup under partial participation. Empirically, \texttt{FedPAC} consistently improves stability and accuracy across vision and language tasks, achieving up to $5.8\%$ absolute accuracy gain on CIFAR-100 with ViTs. Code is available at https://anonymous.4open.science/r/FedPAC-8B24.

Taming Preconditioner Drift: Unlocking the Potential of Second-Order Optimizers for Federated Learning on Non-IID Data

TL;DR

This work proposes FedPAC, a framework for reliable federated second-order optimization that consistently improves stability and accuracy across vision and language tasks, and provides drift-coupled non-convex convergence guarantees with linear speedup under partial participation.

Abstract

Second-order optimizers can significantly accelerate large-scale training, yet their naive federated variants are often unstable or even diverge on non-IID data. We show that a key culprit is \emph{preconditioner drift}: client-side second-order training induces heterogeneous \emph{curvature-defined geometries} (i.e., preconditioner coordinate systems), and server-side model averaging updates computed under incompatible metrics, corrupting the global descent direction. To address this geometric mismatch, we propose \texttt{FedPAC}, a \emph{preconditioner alignment and correction} framework for reliable federated second-order optimization. \texttt{FedPAC} explicitly decouples parameter aggregation from geometry synchronization by: (i) \textbf{Alignment} (i.e.,aggregating local preconditioners into a global reference and warm-starting clients via global preconditioner); and (ii) \textbf{Correction} (i.e., steering local preconditioned updates using a global preconditioned direction to suppress long-term drift). We provide drift-coupled non-convex convergence guarantees with linear speedup under partial participation. Empirically, \texttt{FedPAC} consistently improves stability and accuracy across vision and language tasks, achieving up to absolute accuracy gain on CIFAR-100 with ViTs. Code is available at https://anonymous.4open.science/r/FedPAC-8B24.
Paper Structure (94 sections, 10 theorems, 115 equations, 5 figures, 14 tables, 9 algorithms)

This paper contains 94 sections, 10 theorems, 115 equations, 5 figures, 14 tables, 9 algorithms.

Key Result

Theorem 5.6

Under Assumptions smoothness, bounded_stochastic_gradient_I, bounded_heterogeneity, precond_boundedness, precond_lipschitz_state, $\exists\,G^2>0\ \text{s.t.}\ \sup_{r,i,k}\mathbb{E}\|g_i^{r,k}\|^2\le G^2,$ if we take $g^0=0$ and choose the stepsize $\eta$ such that $\eta \;\le\; \min\left\{\frac{\m where $\Delta=f(\boldsymbol{x}^0)-f^\star$, and $\bar{\Delta}_D=\frac{1}{R}\sum_{r=0}^{R-1}\Delta_D

Figures (5)

  • Figure 1: (a) In non-IID FL, first-order methods converge slowly, inducing little client drift. (b) Second-order methods converge faster locally and thus drift toward local optima, causing the aggregated global model to deviate from global optimum. (c) FedPAC corrects local second-order updates, yielding faster convergence and a global model closer to the global optimum. (d–f) FedPAC accelerates Sophia, Muon and SOAP to train ResNet-18 on CIFAR-100. The x-axis denotes the number of communication rounds, and the y-axis denotes test accuracy.
  • Figure 2: The x-axis denotes communication rounds, and the y-axis is test accuracy. (a, c): In FL on IID data, second-order optimizers converge significantly faster than SGD and AdamW for training ResNet and Transformer. (b),(d): However, on non-IID data, second-order optimizer (e.g., Local Muon) converges much more slowly and can even underperform first-order methods such as Local SGD.
  • Figure 3: Performance and Preconditioner Drift of Local SOAP, FedPAC_SOAP. For SOAP, we compute the preconditioner drift in a layer-wise manner by measuring the spectral norm of the difference between each client’s left/right preconditioner and the aggregated global preconditioner. Our FedPAC_SOAP substantially reduces the preconditioner drift of Local SOAP and accelerates convergence in (a,b), reaching 40% test accuracy in fewer rounds, as shown in (c) and (d).
  • Figure 4: An illustration of preconditioner drift, which corrects client drift through local-global alignment.
  • Figure 5: Test accuracy versus communication rounds. The x-axis denotes communication rounds and the y-axis denotes the test accuracy. Panels (a--d) correspond to training with ResNet-18, while panels (e--h) correspond to training with ViT-Tiny.

Theorems & Definitions (15)

  • Theorem 5.6: Convergence of FedSOA for non-convex functions
  • Theorem 5.7: Convergence of FedPAC for non-convex functions
  • Lemma 2.1: Drift-induced preconditioned disagreement
  • proof
  • Lemma 3.1: Drift-induced preconditioned disagreement
  • Lemma 3.2: Descent with inexact direction
  • proof
  • Lemma 3.3: Recursive estimator error (no explicit $\sigma_g^2$)
  • proof
  • Theorem 3.4: Non-convex convergence without explicit heterogeneity
  • ...and 5 more