Empowering Federated Learning for Massive Models with NVIDIA FLARE
Holger R. Roth, Ziyue Xu, Yuan-Ting Hsieh, Adithya Renduchintala, Isaac Yang, Zhihong Zhang, Yuhong Wen, Sean Yang, Kevin Lu, Kristopher Kersten, Camir Ricketts, Daguang Xu, Chester Chen, Yan Cheng, Andrew Feng
TL;DR
This work addresses the data-access bottleneck in training massive LLMs by leveraging federated learning through NVIDIA FLARE (NVFlare). It demonstrates how NVFlare's Client API and data-streaming capabilities enable scalable, privacy-preserving PEFT and full SFT in NLP and biopharma contexts, including federated protein embeddings. The paper presents a practical architecture for FL, detailing server-controller workflows and streaming to handle large model updates, with experiments spanning large-model streaming, PEFT, SFT, and a subcellular-location task using BioNeMo/ESM-1nv. The results indicate that FL with NVFlare can enhance robustness and accuracy while avoiding centralized data sharing, offering a viable path for production-grade federated training of massive models.
Abstract
In the ever-evolving landscape of artificial intelligence (AI) and large language models (LLMs), handling and leveraging data effectively has become a critical challenge. Most state-of-the-art machine learning algorithms are data-centric. However, as the lifeblood of model performance, necessary data cannot always be centralized due to various factors such as privacy, regulation, geopolitics, copyright issues, and the sheer effort required to move vast datasets. In this paper, we explore how federated learning enabled by NVIDIA FLARE can address these challenges with easy and scalable integration capabilities, enabling parameter-efficient and full supervised fine-tuning of LLMs for natural language processing and biopharmaceutical applications to enhance their accuracy and robustness.
