Table of Contents
Fetching ...

FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models

Rui Ye, Rui Ge, Xinyu Zhu, Jingyi Chai, Yaxin Du, Yang Liu, Yanfeng Wang, Siheng Chen

TL;DR

The paper addresses the lack of realistic benchmarks for federated learning of large language models (FedLLM) by introducing FedLLM-Bench, a comprehensive testbed with four datasets, eight training methods, and six evaluation metrics designed to reflect real-world heterogeneity across language, data quality, length, and user preferences. It includes four naturally split datasets (Fed-Aya, Fed-ChatbotIT, Fed-WildChat, Fed-ChatbotPA) with 38–747 clients and integrates eight representative FL baselines using parameter-efficient fine-tuning via LoRA. The authors provide extensive benchmarking results, revealing consistent gains from federation over local training while highlighting the importance of language personalization and cross-language collaboration, as well as privacy–utility trade-offs under differential privacy. The work offers practical value by enabling fair comparisons, reducing implementation efforts, and guiding future directions in FedLLM research, including broader model coverage and safety alignment considerations.

Abstract

Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM). Following this training paradigm, the community has put massive efforts from diverse aspects including framework, performance, and privacy. However, an unpleasant fact is that there are currently no realistic datasets and benchmarks for FedLLM and previous works all rely on artificially constructed datasets, failing to capture properties in real-world scenarios. Addressing this, we propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics, to offer a comprehensive testbed for the FedLLM community. FedLLM-Bench encompasses three datasets (e.g., user-annotated multilingual dataset) for federated instruction tuning and one dataset (e.g., user-annotated preference dataset) for federated preference alignment, whose scale of client number ranges from 38 to 747. Our datasets incorporate several representative diversities: language, quality, quantity, instruction, length, embedding, and preference, capturing properties in real-world scenarios. Based on FedLLM-Bench, we conduct experiments on all datasets to benchmark existing FL methods and provide empirical insights (e.g., multilingual collaboration). We believe that our FedLLM-Bench can benefit the FedLLM community by reducing required efforts, providing a practical testbed, and promoting fair comparisons. Code and datasets are available at https://github.com/rui-ye/FedLLM-Bench.

FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models

TL;DR

The paper addresses the lack of realistic benchmarks for federated learning of large language models (FedLLM) by introducing FedLLM-Bench, a comprehensive testbed with four datasets, eight training methods, and six evaluation metrics designed to reflect real-world heterogeneity across language, data quality, length, and user preferences. It includes four naturally split datasets (Fed-Aya, Fed-ChatbotIT, Fed-WildChat, Fed-ChatbotPA) with 38–747 clients and integrates eight representative FL baselines using parameter-efficient fine-tuning via LoRA. The authors provide extensive benchmarking results, revealing consistent gains from federation over local training while highlighting the importance of language personalization and cross-language collaboration, as well as privacy–utility trade-offs under differential privacy. The work offers practical value by enabling fair comparisons, reducing implementation efforts, and guiding future directions in FedLLM research, including broader model coverage and safety alignment considerations.

Abstract

Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM). Following this training paradigm, the community has put massive efforts from diverse aspects including framework, performance, and privacy. However, an unpleasant fact is that there are currently no realistic datasets and benchmarks for FedLLM and previous works all rely on artificially constructed datasets, failing to capture properties in real-world scenarios. Addressing this, we propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics, to offer a comprehensive testbed for the FedLLM community. FedLLM-Bench encompasses three datasets (e.g., user-annotated multilingual dataset) for federated instruction tuning and one dataset (e.g., user-annotated preference dataset) for federated preference alignment, whose scale of client number ranges from 38 to 747. Our datasets incorporate several representative diversities: language, quality, quantity, instruction, length, embedding, and preference, capturing properties in real-world scenarios. Based on FedLLM-Bench, we conduct experiments on all datasets to benchmark existing FL methods and provide empirical insights (e.g., multilingual collaboration). We believe that our FedLLM-Bench can benefit the FedLLM community by reducing required efforts, providing a practical testbed, and promoting fair comparisons. Code and datasets are available at https://github.com/rui-ye/FedLLM-Bench.
Paper Structure (25 sections, 5 equations, 12 figures, 8 tables)

This paper contains 25 sections, 5 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: (a) Langauge distribution of clients in Fed-Aya dataset. (b) The distribution of length of instruction and response of clients' data. (c) Distribution of length preference (the ratio of a user preferring longer response) of clients in Fed-ChatbotPA dataset. (d) Distribution of quality preference (quality difference between preferred and dispreferred data) of clients in Fed-ChatbotPA dataset.
  • Figure 2: Distributions of top 10 verbs in instructions (10 clients are plotted for illustration). Our realistic FedLLM datasets exhibit diverse patterns with respect to instruction types.
  • Figure 3: The dataset quality distribution of clients in four training datasets: Fed-Aya, Fed-WildChat, Fed-ChatbotIT and Fed-ChatbotPA. We average the IFD scores of all instruction-response pairs of each client to represent the client's dataset quality.
  • Figure 4: The t-SNE visualization of embeddings of instruction-response pairs from 10 clients in Fed-Aya, Fed-ChatbotIT, Fed-WildChat, and Fed-ChatbotPA datasets. Each color denotes one client. We can see clustering phenomenon of one client's data and that clients' data are diverse.
  • Figure 5: Data quantity distribution across clients of our four FedLLM datasets. We can see a variety of data quantities of clients, where a large proportion of clients have relatively few data.
  • ...and 7 more figures