Table of Contents
Fetching ...

Enhancing Data Quality in Federated Fine-Tuning of Foundation Models

Wanru Zhao, Yaxin Du, Nicholas Donald Lane, Siheng Chen, Yanfeng Wang

TL;DR

A data quality control pipeline for federated fine-tuning of foundation models is proposed, which computes scores reflecting the quality of training data and determines a global threshold for a unified standard, aiming for improved global performance.

Abstract

In the current landscape of foundation model training, there is a significant reliance on public domain data, which is nearing exhaustion according to recent research. To further scale up, it is crucial to incorporate collaboration among multiple specialized and high-quality private domain data sources. However, the challenge of training models locally without sharing private data presents numerous obstacles in data quality control. To tackle this issue, we propose a data quality control pipeline for federated fine-tuning of foundation models. This pipeline computes scores reflecting the quality of training data and determines a global threshold for a unified standard, aiming for improved global performance. Our experiments show that the proposed quality control pipeline facilitates the effectiveness and reliability of the model training, leading to better performance.

Enhancing Data Quality in Federated Fine-Tuning of Foundation Models

TL;DR

A data quality control pipeline for federated fine-tuning of foundation models is proposed, which computes scores reflecting the quality of training data and determines a global threshold for a unified standard, aiming for improved global performance.

Abstract

In the current landscape of foundation model training, there is a significant reliance on public domain data, which is nearing exhaustion according to recent research. To further scale up, it is crucial to incorporate collaboration among multiple specialized and high-quality private domain data sources. However, the challenge of training models locally without sharing private data presents numerous obstacles in data quality control. To tackle this issue, we propose a data quality control pipeline for federated fine-tuning of foundation models. This pipeline computes scores reflecting the quality of training data and determines a global threshold for a unified standard, aiming for improved global performance. Our experiments show that the proposed quality control pipeline facilitates the effectiveness and reliability of the model training, leading to better performance.
Paper Structure (39 sections, 4 figures, 5 tables)

This paper contains 39 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The impact of low-quality data on the performance of federated fine-tuning of LLMs .
  • Figure 2: Overall workflow diagram consists of two phases: 1) Phase I: client-side compute each sample’s quality score with scoring functions using the public validation set and global model, then server-side aggregates the scores, giving a global threshold by anchor data 2) Phase II: clients filter data according to the global threshold and starts federated learning on selected high-quality data.
  • Figure 3: Data compromisation of high-quality and low-quality data with NIID-1 and NIID-2
  • Figure 4: Number of selected data and the proportion of low-quality data across different selection principles (select by proportion, score threshold, or anchor set score) in federated NIID1 setting, employing the ConPro score.