FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning
Zhuo Zhang, Jingyuan Zhang, Jintao Huang, Lizhen Qu, Hongzhi Zhang, Qifan Wang, Xun Zhou, Zenglin Xu
TL;DR
FewFedPIT tackles privacy and data-scarcity in federated instruction tuning by combining on-device synthetic data generation with parameter isolation training and privacy-preserving local aggregation. The method leverages a federated LLM as a data generator, filters synthetic examples with an LLM-based judge using an instruction-following score, and blends public and private updates via a tunable parameter $eta$ to mitigate data leakage. Empirical results across three instruction datasets show FewFedPIT outperforming standard FedIT baselines and approaching centralized performance, while offering flexible privacy-utility tradeoffs. The approach demonstrates robust performance under non-IID conditions and provides actionable strategies to defend against training data extraction attacks in federated LLM settings.
Abstract
Instruction tuning has been identified as a crucial technique for optimizing the performance of large language models (LLMs) in generating human-aligned responses. Nonetheless, gathering diversified and superior-quality instruction data for such tuning presents notable obstacles, especially in domains with rigid privacy provisions. Federated instruction tuning (FedIT) has emerged as a promising solution, by consolidating collaborative training across multiple data owners, thereby resulting in a privacy-preserving learning model. However, FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks. In this paper, we propose a novel federated algorithm, FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning. FewFedPITcomprises three vital components on the client side: (1) synthetic data generation, which utilizes LLMs' in-context learning capacity to generate synthetic data autonomously, thus expanding the local database; (2) parameter isolation training, which individually updates the public parameters in the synthetic data and the private parameters in the local data, consequently mitigating the noise impact of the synthetic data; (3) local aggregation sharing, which mixes public and private parameters before uploading, effectively preventing data extraction attacks. Extensive experiments on three open-source datasets demonstrate the effectiveness of FewFedPITin, enhancing privacy preservation and improving federated few-shot performance.
