Table of Contents
Fetching ...

FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models

Yan Gao, Massimo Roberto Scamarcia, Javier Fernandez-Marques, Mohammad Naseri, Chong Shen Ng, Dimitris Stripelis, Zexi Li, Tao Shen, Jiamu Bai, Daoyuan Chen, Zikai Zhang, Rui Hu, InSeo Song, Lee KangYoon, Hong Jia, Ting Dang, Junyan Wang, Zheyuan Liu, Daniel Janes Beutel, Lingjuan Lyu, Nicholas D. Lane

TL;DR

FlowerTune introduces a cross-domain, federated fine-tuning benchmark for LLMs, evaluating 26 base models across general NLP, finance, medical, and coding using adapter-based PEFT (LoRA/DoRA) within a unified FL framework. The results indicate instruct-tuned models generally outperform non-instruct ones and that base-model choice drives most performance differences, with resource usage remaining manageable under the proposed setup. The work provides domain-specific evaluation pipelines and baseline results to advance privacy-preserving, domain-adapted LLMs, while revealing that aggregation strategy differences are modest and highlighting the need for specialized FL methods tailored to Low-Rank adapters. Overall, FlowerTune establishes a community-driven, reproducible platform to accelerate federated LLM development in privacy-sensitive domains.

Abstract

Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domain-specific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLM Leaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a collaborative, open-source and community-driven approach, provide the first comprehensive comparison across 26 pre-trained LLMs with different aggregation and fine-tuning strategies under federated settings, offering actionable insights into model performance, resource constraints, and domain adaptation. This work lays the foundation for developing privacy-preserving, domain-specialized LLMs for real-world applications.

FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models

TL;DR

FlowerTune introduces a cross-domain, federated fine-tuning benchmark for LLMs, evaluating 26 base models across general NLP, finance, medical, and coding using adapter-based PEFT (LoRA/DoRA) within a unified FL framework. The results indicate instruct-tuned models generally outperform non-instruct ones and that base-model choice drives most performance differences, with resource usage remaining manageable under the proposed setup. The work provides domain-specific evaluation pipelines and baseline results to advance privacy-preserving, domain-adapted LLMs, while revealing that aggregation strategy differences are modest and highlighting the need for specialized FL methods tailored to Low-Rank adapters. Overall, FlowerTune establishes a community-driven, reproducible platform to accelerate federated LLM development in privacy-sensitive domains.

Abstract

Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domain-specific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLM Leaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a collaborative, open-source and community-driven approach, provide the first comprehensive comparison across 26 pre-trained LLMs with different aggregation and fine-tuning strategies under federated settings, offering actionable insights into model performance, resource constraints, and domain adaptation. This work lays the foundation for developing privacy-preserving, domain-specialized LLMs for real-world applications.

Paper Structure

This paper contains 29 sections, 13 figures, 21 tables.

Figures (13)

  • Figure 1: Overview of the FlowerTune LLM Leaderboard. This leaderboard provides four challenges covering: general NLP, finance, medical, and coding. After selecting a challenge, participants can initiate federated fine-tuning using a provided template tailored to the specific scenario. The template is model-agnostic and supports various pre-trained base models, fine-tuning strategies, and aggregation algorithms, enabling flexible adaptation. Upon completion of training, the resulting global LLM is evaluated using domain-specific metrics, with scores reported to reflect the performance and quality of the tuned model.
  • Figure 2: Illustration of the federated LLM fine-tuning process. (1) Initialization of LoRA/DoRA adapters and client selection on the server; (2) transmission of adapter parameters to the selected clients; (3) local adapter fine-tuning with the base model frozen; (4) transmission of updated adapter parameters back to the server; (5) aggregation of adapter parameters. This process is repeated in each subsequent FL round.
  • Figure 3: Average training loss over FL rounds with 6 selected models on four challenges. The training loss exhibits a consistent downward trend across all tasks, with larger fluctuations observed in the coding challenge.
  • Figure 4: Accuracy (%) versus system performance for different non-instruct base models federated fine-tuned on the General NLP challenge, presented in table \ref{['tab:nlp_base']}. The errorbar indicates $\pm1$ Std. Dev. on the different downstream tasks.
  • Figure 5: Accuracy (%) versus system performance for different non-instruct base models federated fine-tuned on the Finance challenge, presented in table \ref{['tab:finance_base']}. The errorbar indicates $\pm1$ Std. Dev. on the different downstream tasks.
  • ...and 8 more figures