Table of Contents
Fetching ...

NebulaFL: Effective Asynchronous Federated Learning for JointCloud Computing

Fei Gao, Ming Hu, Zhiyu Xie, Peichang Shi, Xiaofei Xie, Guodong Yi, Huaimin Wang

TL;DR

NebulaFL addresses the challenges of Federated Learning as a Service in JointCloud Computing by introducing asynchronous intra-DC training with multiple planet models and a stellar model for knowledge sharing, paired with inter-DC model rotation to reduce cross-cloud communication. It couples a reward-guided container selection and resource scheduling mechanism with a dual-criteria optimization that balances training time and rent cost, using both performance rewards and curiosity bonuses. The approach is supported by a convergence analysis akin to FedAvg under standard assumptions and extensive experiments showing up to 5.71% accuracy gains, up to 50% communication overhead reduction, and up to 61.94% cost savings at target accuracy. Collectively, NebulaFL demonstrates scalable, privacy-preserving collaboration across multiple data centers with TEEs, delivering practical improvements for JointCloud FL deployment.

Abstract

With advancements in AI infrastructure and Trusted Execution Environment (TEE) technology, Federated Learning as a Service (FLaaS) through JointCloud Computing (JCC) is promising to break through the resource constraints caused by heterogeneous edge devices in the traditional Federated Learning (FL) paradigm. Specifically, with the protection from TEE, data owners can achieve efficient model training with high-performance AI services in the cloud. By providing additional FL services, cloud service providers can achieve collaborative learning among data owners. However, FLaaS still faces three challenges, i.e., i) low training performance caused by heterogeneous data among data owners, ii) high communication overhead among different clouds (i.e., data centers), and iii) lack of efficient resource scheduling strategies to balance training time and cost. To address these challenges, this paper presents a novel asynchronous FL approach named NebulaFL for collaborative model training among multiple clouds. To address data heterogeneity issues, NebulaFL adopts a version control-based asynchronous FL training scheme in each data center to balance training time among data owners. To reduce communication overhead, NebulaFL adopts a decentralized model rotation mechanism to achieve effective knowledge sharing among data centers. To balance training time and cost, NebulaFL integrates a reward-guided strategy for data owners selection and resource scheduling. The experimental results demonstrate that, compared to the state-of-the-art FL methods, NebulaFL can achieve up to 5.71\% accuracy improvement. In addition, NebulaFL can reduce up to 50% communication overhead and 61.94% costs under a target accuracy.

NebulaFL: Effective Asynchronous Federated Learning for JointCloud Computing

TL;DR

NebulaFL addresses the challenges of Federated Learning as a Service in JointCloud Computing by introducing asynchronous intra-DC training with multiple planet models and a stellar model for knowledge sharing, paired with inter-DC model rotation to reduce cross-cloud communication. It couples a reward-guided container selection and resource scheduling mechanism with a dual-criteria optimization that balances training time and rent cost, using both performance rewards and curiosity bonuses. The approach is supported by a convergence analysis akin to FedAvg under standard assumptions and extensive experiments showing up to 5.71% accuracy gains, up to 50% communication overhead reduction, and up to 61.94% cost savings at target accuracy. Collectively, NebulaFL demonstrates scalable, privacy-preserving collaboration across multiple data centers with TEEs, delivering practical improvements for JointCloud FL deployment.

Abstract

With advancements in AI infrastructure and Trusted Execution Environment (TEE) technology, Federated Learning as a Service (FLaaS) through JointCloud Computing (JCC) is promising to break through the resource constraints caused by heterogeneous edge devices in the traditional Federated Learning (FL) paradigm. Specifically, with the protection from TEE, data owners can achieve efficient model training with high-performance AI services in the cloud. By providing additional FL services, cloud service providers can achieve collaborative learning among data owners. However, FLaaS still faces three challenges, i.e., i) low training performance caused by heterogeneous data among data owners, ii) high communication overhead among different clouds (i.e., data centers), and iii) lack of efficient resource scheduling strategies to balance training time and cost. To address these challenges, this paper presents a novel asynchronous FL approach named NebulaFL for collaborative model training among multiple clouds. To address data heterogeneity issues, NebulaFL adopts a version control-based asynchronous FL training scheme in each data center to balance training time among data owners. To reduce communication overhead, NebulaFL adopts a decentralized model rotation mechanism to achieve effective knowledge sharing among data centers. To balance training time and cost, NebulaFL integrates a reward-guided strategy for data owners selection and resource scheduling. The experimental results demonstrate that, compared to the state-of-the-art FL methods, NebulaFL can achieve up to 5.71\% accuracy improvement. In addition, NebulaFL can reduce up to 50% communication overhead and 61.94% costs under a target accuracy.

Paper Structure

This paper contains 25 sections, 13 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Framework and workflow of our NebulaFL approach
  • Figure 2: Comparison of communication overhead with different configs
  • Figure 3: Learning curves for different numbers of centers
  • Figure 4: Learning curves for different numbers of centers
  • Figure 5: Ablation study aboout rotation strategy