Table of Contents
Fetching ...

Federated Learning Framework for Scalable AI in Heterogeneous HPC and Cloud Environments

Sangam Ghimire, Paribartan Timalsina, Nirjal Bhurtel, Bishal Neupane, Bigyan Byanju Shrestha, Subarna Bhattarai, Prajwal Gaire, Jessica Thapa, Sudan Jha

TL;DR

This paper tackles the challenge of building privacy-preserving AI at scale on heterogeneous HPC and cloud infrastructures. It presents a modular federated learning framework with a Central Orchestrator, flexible Communication Layer, and Scheduler Adapter, complemented by heterogeneity-aware optimizations such as adaptive client selection, straggler mitigation, and communication-efficient updates. The approach supports non-IID data through robust aggregation (FedAvg, FedProx, weighted schemes) and demonstrates near-linear scalability, strong fault tolerance, and maintained accuracy on a hybrid HPC–cloud testbed. The work shows that federated learning can be practical and effective across mixed infrastructures, enabling scalable AI with data locality and privacy guarantees in real-world deployments.

Abstract

As the demand grows for scalable and privacy-aware AI systems, Federated Learning (FL) has emerged as a promising solution, allowing decentralized model training without moving raw data. At the same time, the combination of high-performance computing (HPC) and cloud infrastructure offers vast computing power but introduces new complexities, especially when dealing with heterogeneous hardware, communication limits, and non-uniform data. In this work, we present a federated learning framework built to run efficiently across mixed HPC and cloud environments. Our system addresses key challenges such as system heterogeneity, communication overhead, and resource scheduling, while maintaining model accuracy and data privacy. Through experiments on a hybrid testbed, we demonstrate strong performance in terms of scalability, fault tolerance, and convergence, even under non-Independent and Identically Distributed (non-IID) data distributions and varied hardware. These results highlight the potential of federated learning as a practical approach to building scalable Artificial Intelligence (AI) systems in modern, distributed computing settings.

Federated Learning Framework for Scalable AI in Heterogeneous HPC and Cloud Environments

TL;DR

This paper tackles the challenge of building privacy-preserving AI at scale on heterogeneous HPC and cloud infrastructures. It presents a modular federated learning framework with a Central Orchestrator, flexible Communication Layer, and Scheduler Adapter, complemented by heterogeneity-aware optimizations such as adaptive client selection, straggler mitigation, and communication-efficient updates. The approach supports non-IID data through robust aggregation (FedAvg, FedProx, weighted schemes) and demonstrates near-linear scalability, strong fault tolerance, and maintained accuracy on a hybrid HPC–cloud testbed. The work shows that federated learning can be practical and effective across mixed infrastructures, enabling scalable AI with data locality and privacy guarantees in real-world deployments.

Abstract

As the demand grows for scalable and privacy-aware AI systems, Federated Learning (FL) has emerged as a promising solution, allowing decentralized model training without moving raw data. At the same time, the combination of high-performance computing (HPC) and cloud infrastructure offers vast computing power but introduces new complexities, especially when dealing with heterogeneous hardware, communication limits, and non-uniform data. In this work, we present a federated learning framework built to run efficiently across mixed HPC and cloud environments. Our system addresses key challenges such as system heterogeneity, communication overhead, and resource scheduling, while maintaining model accuracy and data privacy. Through experiments on a hybrid testbed, we demonstrate strong performance in terms of scalability, fault tolerance, and convergence, even under non-Independent and Identically Distributed (non-IID) data distributions and varied hardware. These results highlight the potential of federated learning as a practical approach to building scalable Artificial Intelligence (AI) systems in modern, distributed computing settings.

Paper Structure

This paper contains 19 sections, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: Architecture of the proposed federated learning framework.
  • Figure 2: Accuracy comparison of FedAvg and FedProx across different datasets under non-IID settings.