Table of Contents
Fetching ...

D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning

Jia Zhang, Chen-Xi Zhang, Yao Liu, Yi-Xuan Jin, Xiao-Wen Yang, Bo Zheng, Yi Liu, Lan-Zhe Guo

TL;DR

This work targets data-efficient instruction tuning for LLMs by automatically selecting a valuable data subset. It introduces D3, a three-criteria framework that ranks samples by diversity, difficulty (via uncertainty-aware UPD), and dependability (via a teacher model), and optimizes a weighted coreset through a greedy, multi-round process. Empirical results show that using only 5%–10% of data can achieve competitive or superior instruction-following performance across multiple public benchmarks and a real-world Taobao Live setting, demonstrating clear efficiency gains. The approach reduces data and compute needs while maintaining high effectiveness, with ablations confirming the necessity of jointly considering all three criteria. The work advances practical data selection for scalable, high-quality instruction tuning in real-world applications.

Abstract

Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and efficiency of instruction tuning. In this paper, we first establish data selection criteria based on three distinct aspects of data value: diversity, difficulty, and dependability, and then propose the D3 method comprising two key steps of scoring and selection. Specifically, in the scoring step, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty to evaluate sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we integrate an external LLM for dependability assessment. In the selection step, we formulate the D3 weighted coreset objective, which jointly optimizes three aspects of data value to solve for the most valuable subset. The two steps of D3 can iterate multiple rounds, incorporating feedback to refine the selection focus adaptively. Experiments on both public datasets and the real-world Taobao Live application demonstrate the effectiveness of D3 in endowing LLMs with competitive or even superior instruction-following capabilities using less than 10\% of the entire dataset.

D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning

TL;DR

This work targets data-efficient instruction tuning for LLMs by automatically selecting a valuable data subset. It introduces D3, a three-criteria framework that ranks samples by diversity, difficulty (via uncertainty-aware UPD), and dependability (via a teacher model), and optimizes a weighted coreset through a greedy, multi-round process. Empirical results show that using only 5%–10% of data can achieve competitive or superior instruction-following performance across multiple public benchmarks and a real-world Taobao Live setting, demonstrating clear efficiency gains. The approach reduces data and compute needs while maintaining high effectiveness, with ablations confirming the necessity of jointly considering all three criteria. The work advances practical data selection for scalable, high-quality instruction tuning in real-world applications.

Abstract

Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and efficiency of instruction tuning. In this paper, we first establish data selection criteria based on three distinct aspects of data value: diversity, difficulty, and dependability, and then propose the D3 method comprising two key steps of scoring and selection. Specifically, in the scoring step, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty to evaluate sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we integrate an external LLM for dependability assessment. In the selection step, we formulate the D3 weighted coreset objective, which jointly optimizes three aspects of data value to solve for the most valuable subset. The two steps of D3 can iterate multiple rounds, incorporating feedback to refine the selection focus adaptively. Experiments on both public datasets and the real-world Taobao Live application demonstrate the effectiveness of D3 in endowing LLMs with competitive or even superior instruction-following capabilities using less than 10\% of the entire dataset.

Paper Structure

This paper contains 31 sections, 11 equations, 6 figures, 8 tables, 2 algorithms.

Figures (6)

  • Figure 1: The overall framework of D3 data selection method, including the warm-up and two key steps of scoring and selection.
  • Figure 2: Token-wise analysis reveals two causes for tokens that are difficult to predict: context-oriented diversity and the model’s weak instruction-following, leading to confident but incorrect predictions. Instruction tuning should prioritize correcting poor instruction-following predictions rather than fitting difficult tokens caused by CoGD, as this risks reducing generation diversity and overfitting.
  • Figure 3: Performance variation across different selection budgets on Alpaca is presented. Detailed results can be found in the appendix.
  • Figure 4: The performance comparisons of D3 and D3(MR), which selects data through multi-round scoring and selection.
  • Figure 5: The prompt template used to evaluation the sample depenability.
  • ...and 1 more figures