FedNAR: Federated Optimization with Normalized Annealing Regularization

Junbo Li; Ang Li; Chong Tian; Qirong Ho; Eric P. Xing; Hongyi Wang

FedNAR: Federated Optimization with Normalized Annealing Regularization

Junbo Li, Ang Li, Chong Tian, Qirong Ho, Eric P. Xing, Hongyi Wang

TL;DR

The paper addresses the sensitivity of federated optimization to weight decay and the resulting divergence between local and global objectives. It introduces FedNAR, a plug-in that normalizes and anneals regularization through co-clipping of the gradient and weight decay, with adaptive learning-rate and weight-decay schedules. A theoretical convergence analysis under standard FL assumptions shows FedNAR achieves a $O(1/T)$ convergence term plus bounded weight-decay and heterogeneity errors, and experiments on vision and language tasks demonstrate accelerated convergence and higher accuracy when FedNAR is applied to existing FL baselines. The approach is simple to implement, robust to hyperparameters, and adaptable to 여러 backbone FL algorithms, offering practical improvements for real-world federated learning systems.

Abstract

Weight decay is a standard technique to improve generalization performance in modern deep neural network optimization, and is also widely adopted in federated learning (FL) to prevent overfitting in local clients. In this paper, we first explore the choices of weight decay and identify that weight decay value appreciably influences the convergence of existing FL algorithms. While preventing overfitting is crucial, weight decay can introduce a different optimization goal towards the global objective, which is further amplified in FL due to multiple local updates and heterogeneous data distribution. To address this challenge, we develop {\it Federated optimization with Normalized Annealing Regularization} (FedNAR), a simple yet effective and versatile algorithmic plug-in that can be seamlessly integrated into any existing FL algorithms. Essentially, we regulate the magnitude of each update by performing co-clipping of the gradient and weight decay. We provide a comprehensive theoretical analysis of FedNAR's convergence rate and conduct extensive experiments on both vision and language datasets with different backbone federated optimization algorithms. Our experimental results consistently demonstrate that incorporating FedNAR into existing FL algorithms leads to accelerated convergence and heightened model accuracy. Moreover, FedNAR exhibits resilience in the face of various hyperparameter configurations. Specifically, FedNAR has the ability to self-adjust the weight decay when the initial specification is not optimal, while the accuracy of traditional FL algorithms would markedly decline. Our codes are released at \href{https://github.com/ljb121002/fednar}{https://github.com/ljb121002/fednar}.

FedNAR: Federated Optimization with Normalized Annealing Regularization

TL;DR

convergence term plus bounded weight-decay and heterogeneity errors, and experiments on vision and language tasks demonstrate accelerated convergence and higher accuracy when FedNAR is applied to existing FL baselines. The approach is simple to implement, robust to hyperparameters, and adaptable to 여러 backbone FL algorithms, offering practical improvements for real-world federated learning systems.

Abstract

Paper Structure (35 sections, 9 theorems, 55 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 35 sections, 9 theorems, 55 equations, 8 figures, 2 tables, 2 algorithms.

Introduction
Our contributions.
Related work
Preliminaries: federated optimization formulation
An empirical study of weight decay
FedNAR and its convergence analysis
A general federated optimization framework
Convergence analysis and FedNAR
Discussions.
Choice of $\lambda$ and $\mu$.
Understanding FedNAR
Normalized annealing regularization.
Flexibility of FedNAR.
Comparison with gradient clipping.
Implementation of FedNAR.
...and 20 more sections

Key Result

Lemma 1

For round $t\geq 1$, the global update is equivalent to be where

Figures (8)

Figure 1: Accuracy of three FL algorithms (i.e., FedAvg, FedProx, and SCAFFOLD) after 1000 rounds using different weight decay coefficients. See details in Section \ref{['section:empirical']}.
Figure 2: Influence of weight decay for different settings. We train each algorithm (i.e., FedAvg, FedProx, SCAFFOLD, FedExp, FedAdam, FedAvgm) over 1000 rounds with $\tau$ local steps per round, a local learning rate of 0.01, and a global learning rate of 1.0. We apply a decay for the local learning rate of 0.998 per round and a gradient clipping of max norm 10 as per jhunjhunwala2023fedexp. Given that multiple local steps and imbalanced data distribution are two distinguishing features of FL, we utilize various pairs of $(\alpha, \tau)$ to observe their influence on the results. The baseline setting is chosen to be $(\alpha,\tau)=(0.3,20)$, resulting in a highly imbalanced data distribution. The second configuration reduces the number of local updates to $(\alpha,\tau)=(0.3,5)$. The third configuration employs a balanced data distribution with $(\alpha,\tau)=(10,20)$.
Figure 3: Test accuracy curve for FedAvg, FedProx, SCAFFOLD and their FedNAR variants for $\alpha=0.3$. For each training, we take 3 random seeds.
Figure 4: The self-adjusting capability of FedNAR. WD denotes the initial weight decay value applied in the first round. It is crucial to note that in each round $t$, the weight decay remains consistent for both the baseline methods and FedNAR. The distinction lies in FedNAR's adoption of co-clipping across both weight decay and gradient. We drew comparisons among FedAvg, FedProx, and SCAFFOLD with WD values of 0.1 and 0.01. The utilization of 0.1 as the WD value proved to be less than ideal, resulting in a performance decrement in the baseline methods. Conversely, FedNAR initially mimics this trend but swiftly ameliorates during the subsequent stages, outstripping the baselines even with a more favorable initial WD value of 0.01.
Figure 5: Frequency and strength of clipping. In every round, there is a sum total of 20 clients, and each client carries out 20 updates, leading to a collective tally of 400 updates per round. We monitor the count of clipping instances within these 400 updates during each round and compute the average norm of the updates subjected to clipping. We execute experiments for diverse $\alpha$ values, and for every algorithm, we present the outcomes utilizing three distinct seeds.
...and 3 more figures

Theorems & Definitions (15)

Lemma 1
Theorem 1
Theorem 2
Lemma 1
proof
Theorem 1
proof
Lemma 2
proof
Theorem 2
...and 5 more

FedNAR: Federated Optimization with Normalized Annealing Regularization

TL;DR

Abstract

FedNAR: Federated Optimization with Normalized Annealing Regularization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (15)