Table of Contents
Fetching ...

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

Ziche Liu, Rui Ke, Yajiao Liu, Feng Jiang, Haizhou Li

TL;DR

This paper surveys data selection methods for instruction-tuning large language models, arguing that data quality is more critical than quantity. It introduces a three-stage framework—feature extraction, criteria design, and selector evaluation—and a unified comparison approach using ratio-based efficiency ($PIR, SDF$) and ranking-based feasibility to normalize cross-study evaluations. The analysis reveals that methods with targeted quality measurement achieve higher efficiency but often trade off feasibility and transferability, and it highlights four open challenges for future work. Overall, the framework provides a structured lens to compare data selectors and guides future research toward more objective, multi-goal data selection in instruction-tuning settings.

Abstract

Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets, allowing the trained model to outperform baselines trained on the full dataset. However, the expanding body of research lacks a clear, unified framework, and the variability in experimental settings complicates systematic comparisons. While existing surveys comprehensively overview the stages and methods of data selection, they often overlook an in-depth exploration of the fine-tuning phase. In this paper, we conduct a focused review of recent data selection techniques for fine-tuning LLMs, analyzing a dozen key studies. We introduce a novel three-stage scheme - comprising feature extraction, criteria design, and selector evaluation - to systematically categorize and evaluate these methods. Additionally, we propose a unified comparison approach that incorporates ratio-based efficiency and ranking-based feasibility metrics to address inconsistencies across experiments. Our findings reveal that methods emphasizing more targeted quality measurement achieve higher efficiency but at the cost of feasibility. Finally, we discuss trends and highlight four key challenges in fine-tuning data selection, offering potential directions for future research.

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

TL;DR

This paper surveys data selection methods for instruction-tuning large language models, arguing that data quality is more critical than quantity. It introduces a three-stage framework—feature extraction, criteria design, and selector evaluation—and a unified comparison approach using ratio-based efficiency () and ranking-based feasibility to normalize cross-study evaluations. The analysis reveals that methods with targeted quality measurement achieve higher efficiency but often trade off feasibility and transferability, and it highlights four open challenges for future work. Overall, the framework provides a structured lens to compare data selectors and guides future research toward more objective, multi-goal data selection in instruction-tuning settings.

Abstract

Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets, allowing the trained model to outperform baselines trained on the full dataset. However, the expanding body of research lacks a clear, unified framework, and the variability in experimental settings complicates systematic comparisons. While existing surveys comprehensively overview the stages and methods of data selection, they often overlook an in-depth exploration of the fine-tuning phase. In this paper, we conduct a focused review of recent data selection techniques for fine-tuning LLMs, analyzing a dozen key studies. We introduce a novel three-stage scheme - comprising feature extraction, criteria design, and selector evaluation - to systematically categorize and evaluate these methods. Additionally, we propose a unified comparison approach that incorporates ratio-based efficiency and ranking-based feasibility metrics to address inconsistencies across experiments. Our findings reveal that methods emphasizing more targeted quality measurement achieve higher efficiency but at the cost of feasibility. Finally, we discuss trends and highlight four key challenges in fine-tuning data selection, offering potential directions for future research.
Paper Structure (29 sections, 8 equations, 7 figures, 5 tables, 11 algorithms)

This paper contains 29 sections, 8 equations, 7 figures, 5 tables, 11 algorithms.

Figures (7)

  • Figure 1: An illustration of data selection for fine-tuning LLMs. Fine-tuning a model on the full dataset results in a BaseLine model (BLM), while training a model on a selected high-quality subset produces the Selective-Enhanced Model (SEM), which is expected to outperform the BLM.
  • Figure 2: The Three-stage Scheme of Data Selection for Fine-tuning LLMs. The feature extraction stage transforms the raw data into compact representations to facilitate selection. The criteria design stage constructs quality labels (QL) to capture data quality as selection criteria. The selector evaluation stage identifies the key components to evaluate the effectiveness of the selector.
  • Figure 3: The unified comparison framework, in which IFD and DEITA are chosen for clarity and ease of understanding. The design indicator stage calculates statistics based on papers' reported data, such as Performance Improvement Ratio (PIR) and Selected Dataset Fraction (SDF). The establish relation stage visualizes the interplay between these indicators through graphs. Finally, the derive comparison stage draws conclusions from the graphs.
  • Figure 4: Efficiency comparison of popular data selection methods. The yellow dashed line represents the baseline efficiency $l_{\text{base}}$, with the vertical distance from each method to the baseline indicating its relative efficiency difference.
  • Figure 5: Simplicity and flexibility rank of data selection methods. The yellow line $l_{\text{ranking}}$ is the ranking line, with the projection of each method onto the line indicating its feasibility rank.
  • ...and 2 more figures