Automated Data Curation for Robust Language Model Fine-Tuning

Jiuhai Chen; Jonas Mueller

Automated Data Curation for Robust Language Model Fine-Tuning

Jiuhai Chen, Jonas Mueller

TL;DR

This work introduces an automated data curation pipeline CLEAR (Confidence-based LLM Evaluation And Rectification) for instruction tuning datasets, that can be used with any LLM and fine-tuning procedure, and reveals that CLEAR consistently improves the performance of fine-tuned models across many datasets and models.

Abstract

Large Language Models have become the de facto approach to sequence-to-sequence text generation tasks, but for specialized tasks/domains, a pretrained LLM lacks specific capabilities to produce accurate or well-formatted responses. Supervised fine-tuning specializes a LLM by training it on dataset of example prompts with target responses, but real-world data tends to be noisy. While many fine-tuning algorithms exist, here we consider a \emph{data-centric AI} perspective on LLM fine-tuning, studying how to \emph{systematically} curate the training dataset to improve the LLM produced via \emph{any} fine-tuning algorithm. We introduce an automated data curation pipeline CLEAR (Confidence-based LLM Evaluation And Rectification) for instruction tuning datasets, that can be used with any LLM and fine-tuning procedure. CLEAR estimates which training data is low-quality and either filters or corrects it. Automatically identifying which data to filter or correct is done via LLM-derived confidence estimates, to ensure only confident modifications to the dataset. Unlike existing data curation techniques, CLEAR is a comprehensive framework that can improve a dataset (and trained model outputs) without additional fine-tuning computations. We don't assume access to a stronger LLM than the model being fine-tuned (e.g.\ relying on GPT-4 when fine-tuning GPT-3.5), to see whether CLEAR can meaningfully improve the capabilities of any LLM. Experiments reveal that CLEAR consistently improves the performance of fine-tuned models across many datasets and models (like GPT-3.5 and Llama2).

Automated Data Curation for Robust Language Model Fine-Tuning

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 5 figures, 5 tables)

This paper contains 18 sections, 1 equation, 5 figures, 5 tables.

Introduction
Related Work
Data Curation for ML
Instruction Fine-tuning
Data Curation for Instruction Fine-tuning
Automated Data Curation with CLEAR
Auto-Filter
Auto-Correct
Experiments
Datasets.
Evaluation metrics.
Baseline Methods.
Other Details.
Results
Estimating Response Quality in Auto-Filter
...and 3 more sections

Figures (5)

Figure 1: An overview of the CLEAR data curation procedure to automatically filter and correct bad data in any instruction-tuning dataset composed of instructions/prompts $X_i$ and corresponding target responses $Y_i$.
Figure 2: Comparing confidence vs. score based answer quality evaluators. The confidence-based (BSDetector) evaluator outputs a confidence value between 0 to 1. The direct LLM-scoring evaluator queries GPT-3.5-Turbo using a prompt (shown in Table \ref{['tab:score_prompt']}) that requests a score between 1 to 5 to rate response quality. Higher values from either evaluator suggest higher-quality answers. For the incorrect response in the original dataset from the top figure: the confidence-based evaluator estimates low quality, while the score-based evaluator assigns a score of 4.0. For the correct answer to this prompt (bottom figure): the confidence-based evaluator estimates high quality, while the score-based evaluator still assigns a score of 4.0. Direct LLM score-based evaluation less reliably distinguishes between right vs. wrong responses.
Figure 3: Three examples from the DROP-N dataset. The first example (left) is retained in the dataset because the original response has high BSDetector-estimated confidence (0.91). The second example (middle) has an original response that is estimated to be low confidence (0.41), and the candidate alternative response generated from our fine-tuned LLM is better than the original response with confidence 0.82. Since this exceeds our confidence threshold $\eta=0.8$, we replace the target response for this second example with the LLM-generated candidate response in our curated dataset. The third example (right) has an original response that is estimated to be low confidence (0.03), but we also estimate low confidence (0.21) that the candidate response from our fine-tuned LLM is better. This third example is thus entirely removed from our curated dataset.
Figure 4: Three examples from the SQuAD-N dataset. The first example (left) is retained in the dataset because the original response has high BSDetector-estimated confidence (0.92). The second example (middle) has an original response that is estimated to be low confidence (0.29), and the candidate alternative response generated from our fine-tuned LLM is better than the original response with confidence 0.91. Since this exceeds our confidence threshold $\eta=0.8$, we replace the target response for this second example with the LLM-generated candidate response in our curated dataset. The third example (right) has an original response that is estimated to be low confidence (0.31), but we also estimate low confidence (0.42) that the candidate response from our fine-tuned LLM is better. This third example is thus entirely removed from our curated dataset.
Figure 5: Three examples from the Email-N dataset. The first example (left) is retained in the dataset because the original response has high BSDetector-estimated confidence (0.89). The second example (middle) has an original response that is estimated to be low confidence (0.42), and the candidate alternative response generated from our fine-tuned LLM is better than the original response with confidence 0.84. Since this exceeds our confidence threshold $\eta=0.8$, we replace the target response for this second example with the LLM-generated candidate response in our curated dataset. The third example (right) has an original response that is estimated to be low confidence (0.23), but we also estimate low confidence (0.51) that the candidate response from our fine-tuned LLM is better. This third example is thus entirely removed from our curated dataset.

Automated Data Curation for Robust Language Model Fine-Tuning

TL;DR

Abstract

Automated Data Curation for Robust Language Model Fine-Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)