Table of Contents
Fetching ...

RECOST: External Knowledge Guided Data-efficient Instruction Tuning

Qi Zhang, Yiming Zhang, Haobo Wang, Junbo Zhao

TL;DR

This paper tackles the data-efficiency problem in instruction tuning for large language models when using synthetic instruction datasets. It introduces RECOST, a framework that leverages external knowledge through in-context learning to compute a relative predictive entropy and a diversity-aware core-set sampling strategy, enabling high-quality data selection with minimal data. Across Alpaca, Alpaca-gpt4, and OpenLLM benchmarks, RECOST consistently outperforms prior data-efficient methods and in some cases matches or surpasses full-data models using only 1% of the dataset, demonstrating the practical impact of incorporating exogenous knowledge into data curation. The approach underscores the value of combining external knowledge with diversity-focused sampling to improve data efficiency in instruction tuning, offering a scalable path to reduce training costs for LLM alignment.

Abstract

In the current landscape of large language models (LLMs), the process of instruction tuning serves as an essential step. Considering the high computing power overhead, data-efficient instruction tuning was proposed to reduce the training data size in this process, aiming at selecting high-quality instructional data. Nevertheless, we argue that most current data-efficient instruction-tuning methods are highly dependent on the quality of the original instruction-tuning dataset. When it comes to datasets synthesized by LLMs, a common scenario in this field, dirty samples will even be selected with a higher probability than other samples. To address these challenges, we utilized external knowledge (relevant examples or paragraphs) to evaluate those samples synthesized by LLMs with an in-context-based relative predictive entropy. Based on the new metric, we proposed a framework, dubbed as \textbf{RECOST}, which integrates external-knowledge-base re-ranking and diversity-consistent sampling into a single pipeline. Through extensive experiments on several synthetic datasets (Alpaca and Alpaca-gpt4), we demonstrate the effectiveness of our method and achieve even better results with only \textbf{1\%} of the full dataset.

RECOST: External Knowledge Guided Data-efficient Instruction Tuning

TL;DR

This paper tackles the data-efficiency problem in instruction tuning for large language models when using synthetic instruction datasets. It introduces RECOST, a framework that leverages external knowledge through in-context learning to compute a relative predictive entropy and a diversity-aware core-set sampling strategy, enabling high-quality data selection with minimal data. Across Alpaca, Alpaca-gpt4, and OpenLLM benchmarks, RECOST consistently outperforms prior data-efficient methods and in some cases matches or surpasses full-data models using only 1% of the dataset, demonstrating the practical impact of incorporating exogenous knowledge into data curation. The approach underscores the value of combining external knowledge with diversity-focused sampling to improve data efficiency in instruction tuning, offering a scalable path to reduce training costs for LLM alignment.

Abstract

In the current landscape of large language models (LLMs), the process of instruction tuning serves as an essential step. Considering the high computing power overhead, data-efficient instruction tuning was proposed to reduce the training data size in this process, aiming at selecting high-quality instructional data. Nevertheless, we argue that most current data-efficient instruction-tuning methods are highly dependent on the quality of the original instruction-tuning dataset. When it comes to datasets synthesized by LLMs, a common scenario in this field, dirty samples will even be selected with a higher probability than other samples. To address these challenges, we utilized external knowledge (relevant examples or paragraphs) to evaluate those samples synthesized by LLMs with an in-context-based relative predictive entropy. Based on the new metric, we proposed a framework, dubbed as \textbf{RECOST}, which integrates external-knowledge-base re-ranking and diversity-consistent sampling into a single pipeline. Through extensive experiments on several synthetic datasets (Alpaca and Alpaca-gpt4), we demonstrate the effectiveness of our method and achieve even better results with only \textbf{1\%} of the full dataset.
Paper Structure (32 sections, 4 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 32 sections, 4 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: The dirty data hit rates according to its predictive entropy calculated by LLaMA-2-7b. The horizontal axis represents the percentage ranking, while the vertical axis denotes the proportion of corrupted data within the data preceding that percentage threshold. Given the number of dirty data in the top $i$ data points as $d_i$, the hit rate at $i$ is calculated by $d_i/i$. The dirty data is collected by comparing Alpaca with Alpaca-cleaned.
  • Figure 2: Overview of our proposed method. We start by retrieving in-context knowledge for each under-quantified data point. Two scores are produced by the vanilla LLaMA model on conditions with in-context knowledge or without that. The under-selected data points will be re-ranked by two ranks according to the produced two types of scores. Diversity-consistent sampling will be employed to select the qualified data points to finally supervised fine-tune the language models.
  • Figure 3: Results on Alpagasus Test Set. Figure \ref{['fig:recost']} and Figure \ref{['fig:recost-gpt4']} demonstrate RECOST's performance compared to models fully trained on Alpaca and Alpaca-gpt4 respectively.
  • Figure 4: Comparison of dirty data hit rates under different metrics on the Alpaca dataset calculated by LLaMA-2-7b. The horizontal axis represents the percentage ranking, while the vertical axis denotes the proportion of corrupted data within the data preceding that percentage threshold. Given the number of dirty data in the top $i$ data points as $d_i$, the hit rate at $i$ is calculated by $d_i/i$. The dirty data is collected by comparing Alpaca with Alpaca-cleaned.
  • Figure 5: Performance of relative predictive entropy as the change of mixed weight $w$.
  • ...and 1 more figures