FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models
Rui Ye, Rui Ge, Xinyu Zhu, Jingyi Chai, Yaxin Du, Yang Liu, Yanfeng Wang, Siheng Chen
TL;DR
The paper addresses the lack of realistic benchmarks for federated learning of large language models (FedLLM) by introducing FedLLM-Bench, a comprehensive testbed with four datasets, eight training methods, and six evaluation metrics designed to reflect real-world heterogeneity across language, data quality, length, and user preferences. It includes four naturally split datasets (Fed-Aya, Fed-ChatbotIT, Fed-WildChat, Fed-ChatbotPA) with 38–747 clients and integrates eight representative FL baselines using parameter-efficient fine-tuning via LoRA. The authors provide extensive benchmarking results, revealing consistent gains from federation over local training while highlighting the importance of language personalization and cross-language collaboration, as well as privacy–utility trade-offs under differential privacy. The work offers practical value by enabling fair comparisons, reducing implementation efforts, and guiding future directions in FedLLM research, including broader model coverage and safety alignment considerations.
Abstract
Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM). Following this training paradigm, the community has put massive efforts from diverse aspects including framework, performance, and privacy. However, an unpleasant fact is that there are currently no realistic datasets and benchmarks for FedLLM and previous works all rely on artificially constructed datasets, failing to capture properties in real-world scenarios. Addressing this, we propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics, to offer a comprehensive testbed for the FedLLM community. FedLLM-Bench encompasses three datasets (e.g., user-annotated multilingual dataset) for federated instruction tuning and one dataset (e.g., user-annotated preference dataset) for federated preference alignment, whose scale of client number ranges from 38 to 747. Our datasets incorporate several representative diversities: language, quality, quantity, instruction, length, embedding, and preference, capturing properties in real-world scenarios. Based on FedLLM-Bench, we conduct experiments on all datasets to benchmark existing FL methods and provide empirical insights (e.g., multilingual collaboration). We believe that our FedLLM-Bench can benefit the FedLLM community by reducing required efforts, providing a practical testbed, and promoting fair comparisons. Code and datasets are available at https://github.com/rui-ye/FedLLM-Bench.
