Table of Contents
Fetching ...

Data Quality Control in Federated Instruction-tuning of Large Language Models

Yaxin Du, Rui Ye, Fengting Yuchi, Wanru Zhao, Jingjing Qu, Yanfeng Wang, Siheng Chen

TL;DR

This work tackles data quality challenges in privacy-preserving federated instruction tuning of large language models. It introduces FedDQC, coupling a client-side data-quality metric called Instruction-Response Alignment (IRA) with a quality-aware hierarchical training strategy that progresses from high-IRA to low-IRA data. The approach enables dynamic data selection and staged learning, improving robustness and performance on both synthetic and real-world mixed-quality data under IID and non-IID settings, with only about 1% additional scoring overhead. By preserving data privacy and avoiding extra communication, FedDQC offers a practical, scalable solution for federated LLM instruction tuning in privacy-sensitive domains.

Abstract

Federated Learning (FL) enables privacy-preserving collaborative instruction tuning of large language models (LLMs) by leveraging massively distributed data. However, the decentralized nature of FL exacerbates data quality challenges, as local clients lack global visibility to filter noisy or low-quality samples before training. To resolve this issue, we propose FedDQC, a novel federated instruction tuning framework with dynamic data quality control. Our approach introduces two key innovations. First, we propose instruction-response alignment (IRA), an efficient client-side metric for quality evaluation requiring only low-cost inference. We validate that higher-IRA data corresponds to more relevant and easier-to-learn question-answer pairs. Second, mirroring the human easy-to-hard knowledge acquisition process, we design a quality-aware hierarchical FL training framework, where the LLM is progressively fine-tuned from high- to low-IRA data in a collaborative manner. The framework also supports adaptive data quality assessment at each hierarchy, enabling dynamic adjustments throughout the training process. Extensive experiments on synthetic and real-world datasets show that our method significantly improves LLM performance on mixed-quality data in FL.

Data Quality Control in Federated Instruction-tuning of Large Language Models

TL;DR

This work tackles data quality challenges in privacy-preserving federated instruction tuning of large language models. It introduces FedDQC, coupling a client-side data-quality metric called Instruction-Response Alignment (IRA) with a quality-aware hierarchical training strategy that progresses from high-IRA to low-IRA data. The approach enables dynamic data selection and staged learning, improving robustness and performance on both synthetic and real-world mixed-quality data under IID and non-IID settings, with only about 1% additional scoring overhead. By preserving data privacy and avoiding extra communication, FedDQC offers a practical, scalable solution for federated LLM instruction tuning in privacy-sensitive domains.

Abstract

Federated Learning (FL) enables privacy-preserving collaborative instruction tuning of large language models (LLMs) by leveraging massively distributed data. However, the decentralized nature of FL exacerbates data quality challenges, as local clients lack global visibility to filter noisy or low-quality samples before training. To resolve this issue, we propose FedDQC, a novel federated instruction tuning framework with dynamic data quality control. Our approach introduces two key innovations. First, we propose instruction-response alignment (IRA), an efficient client-side metric for quality evaluation requiring only low-cost inference. We validate that higher-IRA data corresponds to more relevant and easier-to-learn question-answer pairs. Second, mirroring the human easy-to-hard knowledge acquisition process, we design a quality-aware hierarchical FL training framework, where the LLM is progressively fine-tuned from high- to low-IRA data in a collaborative manner. The framework also supports adaptive data quality assessment at each hierarchy, enabling dynamic adjustments throughout the training process. Extensive experiments on synthetic and real-world datasets show that our method significantly improves LLM performance on mixed-quality data in FL.

Paper Structure

This paper contains 60 sections, 2 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Top figure is an example of low-quality data and high-quality data. The left figure shows federated quality heterogeneity. The right figure shows how data quality affects federated training performance and FedDQC eliminates low-quality data effects.
  • Figure 2: Overview of FedDQC, which iterates in two stages: (1) Scoring stage: utilize IRA and global model to evaluate data quality; (2) Hierarchical training: progressively fine-tuned from high-IRA to low-IRA data, mirroring the easy-to-hard learning process; (3) Scoring stage and hierarchical training stage iterates to the last hierarchy.
  • Figure 3: Data Map visualization. (a) Data Map with ground truth quality label. (b) Data Map with IRA scores on a pre-trained model. (c) Data Map with IRA scores on fine-tuned model.
  • Figure 4: Comparison of additional computation costs and performance gain after applying to different quality evaluation metrics on PubMedQA IID setting. IRA adds minimal computational overhead while significantly improves performance by data quality control.
  • Figure 5: Model similarity comparison between FedAvg and FedDQC on PubMedQA dataset in NIID settings.
  • ...and 7 more figures