Empowering Federated Learning for Massive Models with NVIDIA FLARE

Holger R. Roth; Ziyue Xu; Yuan-Ting Hsieh; Adithya Renduchintala; Isaac Yang; Zhihong Zhang; Yuhong Wen; Sean Yang; Kevin Lu; Kristopher Kersten; Camir Ricketts; Daguang Xu; Chester Chen; Yan Cheng; Andrew Feng

Empowering Federated Learning for Massive Models with NVIDIA FLARE

Holger R. Roth, Ziyue Xu, Yuan-Ting Hsieh, Adithya Renduchintala, Isaac Yang, Zhihong Zhang, Yuhong Wen, Sean Yang, Kevin Lu, Kristopher Kersten, Camir Ricketts, Daguang Xu, Chester Chen, Yan Cheng, Andrew Feng

TL;DR

This work addresses the data-access bottleneck in training massive LLMs by leveraging federated learning through NVIDIA FLARE (NVFlare). It demonstrates how NVFlare's Client API and data-streaming capabilities enable scalable, privacy-preserving PEFT and full SFT in NLP and biopharma contexts, including federated protein embeddings. The paper presents a practical architecture for FL, detailing server-controller workflows and streaming to handle large model updates, with experiments spanning large-model streaming, PEFT, SFT, and a subcellular-location task using BioNeMo/ESM-1nv. The results indicate that FL with NVFlare can enhance robustness and accuracy while avoiding centralized data sharing, offering a viable path for production-grade federated training of massive models.

Abstract

In the ever-evolving landscape of artificial intelligence (AI) and large language models (LLMs), handling and leveraging data effectively has become a critical challenge. Most state-of-the-art machine learning algorithms are data-centric. However, as the lifeblood of model performance, necessary data cannot always be centralized due to various factors such as privacy, regulation, geopolitics, copyright issues, and the sheer effort required to move vast datasets. In this paper, we explore how federated learning enabled by NVIDIA FLARE can address these challenges with easy and scalable integration capabilities, enabling parameter-efficient and full supervised fine-tuning of LLMs for natural language processing and biopharmaceutical applications to enhance their accuracy and robustness.

Empowering Federated Learning for Massive Models with NVIDIA FLARE

TL;DR

Abstract

Paper Structure (20 sections, 9 figures, 1 table)

This paper contains 20 sections, 9 figures, 1 table.

Introduction
The Data Challenge
Federated Learning
Methods
FL Framework
Easy Adaptation of ML Workflows via Client API
Server Workflow Implementation
Scalable Model Training via Streaming
Applications
Adaption of Foundational LLMs
FL for LLM Adaptations
Federated Protein Embeddings and Task Model Fitting
Subcellular Location Prediction
Model Architecure
Results
...and 5 more sections

Figures (9)

Figure 1: Server workflow Controller and Executor with Client API.
Figure 2: Data streaming API.
Figure 3: Federated parameter-efficient fine-tuning (PEFT) and full supervised fine-tuning (SFT) with global model and $n$ clients.
Figure 4: Cross section of an animal cell cellimage.
Figure 5: Memory usage during streaming of a 128GB large model.
...and 4 more figures

Empowering Federated Learning for Massive Models with NVIDIA FLARE

TL;DR

Abstract

Empowering Federated Learning for Massive Models with NVIDIA FLARE

Authors

TL;DR

Abstract

Table of Contents

Figures (9)