Table of Contents
Fetching ...

Improving Data Efficiency via Curating LLM-Driven Rating Systems

Jinlong Pang, Jiaheng Wei, Ankit Parag Shah, Zhaowei Zhu, Yaxuan Wang, Chen Qian, Yang Liu, Yujia Bao, Wei Wei

TL;DR

This work tackles data efficiency in instruction tuning by recognizing that raw data quantity often harms performance when LLM-based quality scores are noisy. It introduces DS$^2$, a diversity-aware score curation pipeline that models rating errors with a score transition matrix $T$ and leverages a $k$-NN consensus framework to curate corrected scores while enforcing long-tail diversity. Through OpenLLM leaderboard experiments across multiple base models, DS$^2$ demonstrates that selecting a small, high-quality subset (as low as 3.3% of the original data) can outperform the full data pool and rival human-aligned datasets like LIMA at comparable sizes. The findings challenge traditional data-scaling laws, showing that redundancy and low-quality samples can impede learning, and offering a cost-effective alternative to large-scale data and human annotation for model alignment.

Abstract

Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce DS2, a Diversity-aware Score curation method for Data Selection. By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that "more can be less."

Improving Data Efficiency via Curating LLM-Driven Rating Systems

TL;DR

This work tackles data efficiency in instruction tuning by recognizing that raw data quantity often harms performance when LLM-based quality scores are noisy. It introduces DS, a diversity-aware score curation pipeline that models rating errors with a score transition matrix and leverages a -NN consensus framework to curate corrected scores while enforcing long-tail diversity. Through OpenLLM leaderboard experiments across multiple base models, DS demonstrates that selecting a small, high-quality subset (as low as 3.3% of the original data) can outperform the full data pool and rival human-aligned datasets like LIMA at comparable sizes. The findings challenge traditional data-scaling laws, showing that redundancy and low-quality samples can impede learning, and offering a cost-effective alternative to large-scale data and human annotation for model alignment.

Abstract

Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce DS2, a Diversity-aware Score curation method for Data Selection. By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that "more can be less."

Paper Structure

This paper contains 64 sections, 5 equations, 19 figures, 20 tables, 1 algorithm.

Figures (19)

  • Figure 1: Illustration of data selection pipeline DS$^2$. Step 1 leverages LLMs to evaluate data samples. Step 2 estimates a potential score transition matrix $\boldsymbol{T}$ based on the $k$-Nearest Neighbor ($k$-NN) statistical information (without relying on ground-truth quality scores) then curates the scores. Step 3 calculates the long-tail score for rare-data selection. Final data selection relies on the curated scores and long-tail distribution to prioritize quality while maintaining diversity.
  • Figure 2: Comparison of score distributions across different rating models.
  • Figure 3: Comparison of score transition matrices across different rating models.
  • Figure 4: Examples with high and low long-tail scores.
  • Figure 5: Data scaling efforts of baselines across various rating models. Base model: LLaMA-3.1-8B. The Y-axis is the performance of OpenLLM leaderboard. The X-axis means the # samples used.
  • ...and 14 more figures

Theorems & Definitions (2)

  • Definition 3.1: score transition matrix
  • Definition 3.2: $k$-NN score clusterability