DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models

Jin Liu; Yinbin Miao; Ning Xi; Junkang Liu

DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models

Jin Liu, Yinbin Miao, Ning Xi, Junkang Liu

TL;DR

DP-FedAdamW is proposed, the first AdamW-based optimizer for DPFL, which restores AdamW under DP by stabilizing second-moment variance, removing DP-induced bias, and aligning local updates to the global descent to curb client drift.

Abstract

Balancing convergence efficiency and robustness under Differential Privacy (DP) is a central challenge in Federated Learning (FL). While AdamW accelerates training and fine-tuning in large-scale models, we find that directly applying it to Differentially Private FL (DPFL) suffers from three major issues: (i) data heterogeneity and privacy noise jointly amplify the variance of second-moment estimator, (ii) DP perturbations bias the second-moment estimator, and (iii) DP amplify AdamW sensitivity to local overfitting, worsening client drift. We propose DP-FedAdamW, the first AdamW-based optimizer for DPFL. It restores AdamW under DP by stabilizing second-moment variance, removing DP-induced bias, and aligning local updates to the global descent to curb client drift. Theoretically, we establish an unbiased second-moment estimator and prove a linearly accelerated convergence rate without any heterogeneity assumption, while providing tighter $(\varepsilon,δ)$-DP guarantees. Our empirical results demonstrate the effectiveness of DP-FedAdamW across language and vision Transformers and ResNet-18. On Tiny-ImageNet (Swin-Base, $\varepsilon=1$), DP-FedAdamW outperforms the state-of-the-art (SOTA) by 5.83\%. The code is available in Appendix.

DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models

TL;DR

Abstract

-DP guarantees. Our empirical results demonstrate the effectiveness of DP-FedAdamW across language and vision Transformers and ResNet-18. On Tiny-ImageNet (Swin-Base,

), DP-FedAdamW outperforms the state-of-the-art (SOTA) by 5.83\%. The code is available in Appendix.

Paper Structure (29 sections, 9 theorems, 48 equations, 5 figures, 17 tables, 2 algorithms)

This paper contains 29 sections, 9 theorems, 48 equations, 5 figures, 17 tables, 2 algorithms.

Introduction
Related work
DPFL framework
Motivation and challenges
Methodology
Second-moment aggregation
Unbiased second-moment correction
Local-global alignment
Theoretical analysis
Convergence analysis
Privacy analysis
Experiments
Experimental setup
Performance on vision datasets
Performance on language datasets
...and 14 more sections

Key Result

Theorem 1

Under Assumptions smoothness, bounded_stochastic_gradient_I, and bounded_stochastic_gradient_II, if we take $g^0=0$, then DP-FedAdamW converges as follows Here $G_0:=\frac{1}{N} \sum_{i=1}^N\left\|\nabla f_i\left(\boldsymbol{\theta}^0\right)\right\|^2$,$\Delta=f\left(\boldsymbol{\theta}^0\right)-f^{\star}$, $S$ is the number of participating clients per round, $\sigma$ is DP noise level, $\sigma_

Figures (5)

Figure 1: Illustration of our DP-FedAdamW. It aggregates the mean of block-wise Bias-Corrected second moment estimates and performs local–global alignment to stabilize DPFL optimization.
Figure 2: An illustration of local update in DP-FedAdamW, which corrects client drift caused through global update guidance.
Figure 3: Training on CIFAR-100, Swin-Tiny, $\sigma{=}1$, $\alpha{=}0.1$. (a) Non-IID (DPFL) causes high variance in second-moment estimator across clients of DP-LocalAdamW. (b) DP-LocalAdamW suffers from more severe client drift than FedAvg and LocalAdamW.
Figure 4: Histogram for DP-LocalAdamW, CIFAR-10, Swin-Tiny, $\sigma{=}1$, $\alpha{=}0.1$. (a) The distribution centers of $\boldsymbol{m}^t$ are aligned with or without DP, but the variance is larger with DP. (b) The distribution of $\sqrt{\boldsymbol{v}^t}$ shows a significant difference, with the center of the distribution shifting approximately by $\sqrt{\sigma^2 C^2/(sR)^2}$.
Figure 5: Test accuracy (%) on CIFAR-100 using ResNet-18 and Swin-Tiny under the Dirichlet $\alpha=0.6$ and $\alpha=0.1$ settings.

Theorems & Definitions (17)

Theorem 1: Convergence for non-convex functions
Definition 1
Definition 2
Theorem 2: Privacy guarantee
Lemma 1
Lemma 2
proof
Lemma 3
proof
Lemma 4
...and 7 more

DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models

TL;DR

Abstract

DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (17)