Improving Local Training in Federated Learning via Temperature Scaling

Kichang Lee; Pei Zhang; Songkuk Kim; JeongGil Ko

Improving Local Training in Federated Learning via Temperature Scaling

Kichang Lee, Pei Zhang, Songkuk Kim, JeongGil Ko

TL;DR

This paper tackles the slow convergence of federated learning under non-i.i.d. data by introducing FLex\&Chill, which applies logit chilling through a low softmax temperature $T<1$ during local training. The authors provide theoretical convergence analysis showing gradient amplification with $T<1$ and a bound $\mathbb{E}[F(w^{(t+1)})-F(w^*)] \le (1-\frac{\eta\mu}{T})\mathbb{E}[F(w^{(t)})-F(w^*)] + \frac{L\eta^2\sigma^2}{2T^2}$, alongside extensive empirical results across FEMNIST, CIFAR10, and CIFAR100 demonstrating up to $6\times$ faster convergence and up to $3.37\%$ improvement in inference accuracy. FLex\&Chill is model- and dataset-agnostic, orthogonal to FedProx, SCAFFOLD, and FedBN, and is supported by analyses of gradient norms, CKA-based feature-space similarity, and calibration. The work shows that training-time temperature control can robustly accelerate FL in heterogeneous data environments and provides open-source tooling to enable adoption and further exploration.

Abstract

Federated learning is inherently hampered by data heterogeneity: non-i.i.d. training data over local clients. We propose a novel model training approach for federated learning, FLex&Chill, which exploits the Logit Chilling method. Through extensive evaluations, we demonstrate that, in the presence of non-i.i.d. data characteristics inherent in federated learning systems, this approach can expedite model convergence and improve inference accuracy. Quantitatively, from our experiments, we observe up to 6X improvement in the global federated learning model convergence time, and up to 3.37% improvement in inference accuracy.

Improving Local Training in Federated Learning via Temperature Scaling

TL;DR

This paper tackles the slow convergence of federated learning under non-i.i.d. data by introducing FLex\&Chill, which applies logit chilling through a low softmax temperature

during local training. The authors provide theoretical convergence analysis showing gradient amplification with

and a bound

, alongside extensive empirical results across FEMNIST, CIFAR10, and CIFAR100 demonstrating up to

faster convergence and up to

improvement in inference accuracy. FLex\&Chill is model- and dataset-agnostic, orthogonal to FedProx, SCAFFOLD, and FedBN, and is supported by analyses of gradient norms, CKA-based feature-space similarity, and calibration. The work shows that training-time temperature control can robustly accelerate FL in heterogeneous data environments and provides open-source tooling to enable adoption and further exploration.

Abstract

Paper Structure (21 sections, 31 equations, 24 figures, 9 tables, 1 algorithm)

This paper contains 21 sections, 31 equations, 24 figures, 9 tables, 1 algorithm.

INTRODUCTION
RELATED WORK
Temperature Scaling
Federated Learning
DESIGN OF FLEX&CHILL
Theoretical Analysis
Temperature v.s. Learning Rate
Empirical Analysis
FLex$\&$Chill and Logit Chilling
EVALUATION
Experiment Setup
Average Global Model Accuracy and Loss
Federated Learning Convergence
Model stability of low-temperature training during federated learning aggregation
Impact of Dataset Dispersion and Federated Learning on FLex$\&$Chill
...and 6 more sections

Figures (24)

Figure 1: Effect of varying temperature $T$ on the output distribution of the softmax function, illustrating how lower $T$ sharpens class probabilities and higher $T$ produces smoother, more uniform distributions. Best viewed in color.
Figure 2: Distribution of gradient norm at input layer for correctly (top) / incorrectly (bottom) inferred samples with varying training temperatures.
Figure 3: Distributions depicting differences between distances to the decision boundary before and after model updates for varying training temperatures. Notice that lower temperatures show a noticeable shift in estimations' positions on the representation space, suggesting their aggressiveness in modifying the model even with a small number of training samples.
Figure 4: Example of data points in the 2D representation space with their respective classification boundaries for different federated learning clients with varying training temperatures. Best viewed in color.
Figure 5: Visualization of the distribution of training data used in FEMNIST, CIFAR10-CNN and CIFAR100-ResNet experiments. Best viewed in color.
...and 19 more figures

Theorems & Definitions (1)

proof

Improving Local Training in Federated Learning via Temperature Scaling

TL;DR

Abstract

Improving Local Training in Federated Learning via Temperature Scaling

Authors

TL;DR

Abstract

Table of Contents

Figures (24)

Theorems & Definitions (1)