CyclicFL: A Cyclic Model Pre-Training Approach to Efficient Federated Learning

Pengyu Zhang; Yingbo Zhou; Ming Hu; Xian Wei; Mingsong Chen

CyclicFL: A Cyclic Model Pre-Training Approach to Efficient Federated Learning

Pengyu Zhang, Yingbo Zhou, Ming Hu, Xian Wei, Mingsong Chen

TL;DR

CyclicFL tackles slow convergence and degraded accuracy in federated learning under non-IID data by introducing cyclic pre-training on selected AIoT devices to derive a strong initial global model without exposing local data. It formalizes a two-phase workflow where cyclic pre-training optimizes a task-consistent objective $\mathcal{F}(\mathbf{w})$ starting from a random $\mathbf{w}_{rg}$ to obtain $\mathbf{w}_{wg}$, which then seeds standard FL training. The paper proves data-consistency enhances Lipschitzness of the loss and provides phase-wise convergence rates under $L$-smoothness across strongly convex, convex, and non-convex regimes, showing accelerated convergence. Empirical results across FEMNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 demonstrate up to $14.11$ percentage-point gains in maximum accuracy and substantially faster convergence, while maintaining privacy and compatibility with baseline FL methods. Overall, CyclicFL offers a practical, privacy-preserving path to faster and more accurate FL on security-critical AIoT deployments.

Abstract

Federated learning (FL) has been proposed to enable distributed learning on Artificial Intelligence Internet of Things (AIoT) devices with guarantees of high-level data privacy. Since random initial models in FL can easily result in unregulated Stochastic Gradient Descent (SGD) processes, existing FL methods greatly suffer from both slow convergence and poor accuracy, especially in non-IID scenarios. To address this problem, we propose a novel method named CyclicFL, which can quickly derive effective initial models to guide the SGD processes, thus improving the overall FL training performance. We formally analyze the significance of data consistency between the pre-training and training stages of CyclicFL, showing the limited Lipschitzness of loss for the pre-trained models by CyclicFL. Moreover, we systematically prove that our method can achieve faster convergence speed under various convexity assumptions. Unlike traditional centralized pre-training methods that require public proxy data, CyclicFL pre-trains initial models on selected AIoT devices cyclically without exposing their local data. Therefore, they can be easily integrated into any security-critical FL methods. Comprehensive experimental results show that CyclicFL can not only improve the maximum classification accuracy by up to $14.11\%$ but also significantly accelerate the overall FL training process.

CyclicFL: A Cyclic Model Pre-Training Approach to Efficient Federated Learning

TL;DR

starting from a random

to obtain

, which then seeds standard FL training. The paper proves data-consistency enhances Lipschitzness of the loss and provides phase-wise convergence rates under

-smoothness across strongly convex, convex, and non-convex regimes, showing accelerated convergence. Empirical results across FEMNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 demonstrate up to

percentage-point gains in maximum accuracy and substantially faster convergence, while maintaining privacy and compatibility with baseline FL methods. Overall, CyclicFL offers a practical, privacy-preserving path to faster and more accurate FL on security-critical AIoT deployments.

Abstract

but also significantly accelerate the overall FL training process.

Paper Structure (11 sections, 2 theorems, 15 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 11 sections, 2 theorems, 15 equations, 9 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Methodology
Theoretical Analysis
Data consistency on Lipschitzness of loss
Convergence analysis
Performance Evaluation
Experimental Settings
Experimental Results
Ablation Study
CONCLUSION

Key Result

Lemma 1

(The impact of transferred features on the Lipschitzness of the loss) Let Q represent the target dataset, P represent the pre-training dataset, and $\operatorname{poly}$ represent a polynomial. For a two-layer network with a large number (m) of hidden neurons, if m $\geq \operatorname{poly}\left(n_{ where $0 < \gamma \leq 1$ controls the magnitude of initial network parameters $\mathbf{w}(0)$. $\m

Figures (9)

Figure 1: The loss landscape of a LeNet-5 initialized randomly. The training process is slowed due to the sharp landscape.
Figure 2: Workflow of our pre-training approach.
Figure 3: Test accuracy for CIFAR-100-Fine.
Figure 4: Comparison of test accuracy for CIFAR-10, $\beta=0.5$.
Figure 5: Loss landscape visualizations of a LeNet-5 model for CIFAR-10. $\beta$ is set to $0.5$.
...and 4 more figures

Theorems & Definitions (5)

Definition 1
Lemma 1
Remark 1
Definition 2
Corollary 1

CyclicFL: A Cyclic Model Pre-Training Approach to Efficient Federated Learning

TL;DR

Abstract

CyclicFL: A Cyclic Model Pre-Training Approach to Efficient Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (5)