Scalable Cross-Facility Federated Learning for Scientific Foundation Models on Multiple Supercomputers

Yijiang Li; Zilinghan Li; Kyle Chard; Ian Foster; Todd Munson; Ravi Madduri; Kibaek Kim

Scalable Cross-Facility Federated Learning for Scientific Foundation Models on Multiple Supercomputers

Yijiang Li, Zilinghan Li, Kyle Chard, Ian Foster, Todd Munson, Ravi Madduri, Kibaek Kim

Abstract

Artificial Intelligence for scientific applications increasingly requires training large models on data that cannot be centralized due to privacy constraints, data sovereignty, or the sheer volume of data generated. Federated learning (FL) addresses this by enabling collaborative training without centralizing raw data, but scientific applications demand model scales that requires extensive computing resources, typically offered at High Performance Computing (HPC) facilities. Deploying FL experiments across HPC facilities introduces challenges beyond cloud or enterprise settings. We present a comprehensive cross-facility FL framework for heterogeneous HPC environments, built on Advanced Privacy-Preserving Federated Learning (APPFL) framework with Globus Compute and Transfer orchestration, and evaluate it across four U.S. Department of Energy (DOE) leadership-class supercomputers. We demonstrate that FL experiments across HPC facilities are practically achievable, characterize key sources of heterogeneity impacting the training performance, and show that algorithmic choices matter significantly under realistic HPC scheduling conditions. We validate the scientific applicability by fine-tuning a large language model on a chemistry instruction dataset, and identify scheduler-aware algorithm design as a critical open challenge for future deployments.

Scalable Cross-Facility Federated Learning for Scientific Foundation Models on Multiple Supercomputers

Abstract

Paper Structure (17 sections, 6 figures, 6 tables)

This paper contains 17 sections, 6 figures, 6 tables.

Introduction
Results
Use Case and Data Distribution
FL Configurations
Scalability and Throughput
Training Performance
Large-Scale Co-Scheduled Deployment
Small-Scale Algorithm Comparison
Globus Transfer Communication
Discussion
Related Work and Context
Practical FL Algorithm Design
Lessons Learned
Methods
APPFL and DOE Supercomputers
...and 2 more sections

Figures (6)

Figure 1: Throughput scaling with fixed micro-batch size per GPU. The left panel shows throughput (samples per second) as a function of node count, while the right panel shows the same results as a function of total GPU count. The consistent scaling patterns across both panels confirm that memory capacity and the resulting optimization strategy, rather than node or GPU count alone, are the primary drivers of throughput heterogeneity across supercomputers.
Figure 2: FedAvg test loss over eight global aggregation rounds using 64 nodes per supercomputer (client). The global model (purple) consistently achieves the lowest test loss across all 14 tasks, decreasing from 1.39 to 0.37, while local client models show higher losses reflecting specialization to their assigned task groups.
Figure 3: Test loss progression of four FL algorithms under realistic queueing conditions using two nodes per facility. Each panel shows local client models alongside the global aggregated model within the same wall-clock time budget of 17,000 seconds. FedCompass achieves the lowest final global test loss, while FedAvg produces the most stable client trajectories.
Figure 4: Globus Transfer communication efficiency across clients for five models ranging from OPT-125m to Llama2-13b. Panel a shows the linear relationship between model parameters and storage size in BF16 format. Panel b shows transfer speed versus model size, with Aurora and Polaris achieving the highest speeds due to co-location with the server at Argonne. Panel c summarizes average transfer speeds over all five models per destination facility.
Figure 5: Polaris queue time with varying number of nodes and walltime requested. Solid lines represent the mean queue wait time and shaded regions represent one standard deviation. Different queues are designed to handle jobs with different characteristics in order to maximize supercomputer utilization. Each queue imposes its own constraints on the maximum number of nodes requestable, the maximum walltime allowed, and the allocation of available node-hours.
...and 1 more figures

Scalable Cross-Facility Federated Learning for Scientific Foundation Models on Multiple Supercomputers

Abstract

Scalable Cross-Facility Federated Learning for Scientific Foundation Models on Multiple Supercomputers

Authors

Abstract

Table of Contents

Figures (6)