Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

Yuan Ge; Yilun Liu; Chi Hu; Weibin Meng; Shimin Tao; Xiaofeng Zhao; Hongxia Ma; Li Zhang; Boxing Chen; Hao Yang; Bei Li; Tong Xiao; Jingbo Zhu

Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Boxing Chen, Hao Yang, Bei Li, Tong Xiao, Jingbo Zhu

TL;DR

This paper proposes an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR), which utilizes small models and requires only 11.2% of the monetary cost compared to existing methods, making it easily deployable in industrial scenarios.

Abstract

With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required for training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being affected by biases in GPT models, or reducing the diversity of the selected instruction dataset. In this paper, we propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR). CaR employs a two-step process: first, it ranks instruction pairs using a high-accuracy (84.25%) scoring model aligned with expert preferences; second, it preserves dataset diversity through clustering. In our experiment, CaR efficiently selected a mere 1.96% of Alpaca's IT data, yet the resulting AlpaCaR model surpassed Alpaca's performance by an average of 32.1% in GPT-4 evaluations. Moreover, we find that data selecting is a consistent paradigm whether the pre-trained model is more capable or the model parameters scaling up. Our approach employs compact models with 550M parameters and incurs just 11.2% of the financial outlay of current methods, enhancing its industrial deployability.

Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

TL;DR

Abstract

Paper Structure (48 sections, 13 figures, 8 tables)

This paper contains 48 sections, 13 figures, 8 tables.

Introduction
Method
Motivation
From Quality Estimation to Instruction Pair Quality Estimation.
GPT as a Judge Exhibits Systematic Bias.
Instruction Diversity Inspires LLMs' Multi-tasks Capability.
Clustering and Ranking Method
Single Instruction Pair Quality Estimation
Diversity
Experimental Setup
Test Datasets
Generations
Evaluate Metrics
Results and Analysis
Comparison with Baselines
...and 33 more sections

Figures (13)

Figure 1: Compares the performance of the proposed AlpaCaR model to established baseline models over four test sets. Our AlpaCaR achieves the best model performance with the smallest amount of instruction tuning data.
Figure 2: An overview of Cluster and Ranking (CaR) method. Unlike directly training Alpaca with the entire Alpaca_52k dataset, CaR first uses the IQS model to score all instructions (brown arrow). Then it selects the top $n_1$ instructions ranked by quality. Next, a clustering model (violet arrow) groups all instructions into k clusters, selecting $n_2$ from each. These are concatenated and deduplicated to form a diverse, high-quality sub-dataset for training AlpaCaR.
Figure 3: Consistency between IQS scores and the performance of LLMs.
Figure 4: Model performances with varying $n_1$.
Figure 5: Performances with varying $n_2$.
...and 8 more figures

Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

TL;DR

Abstract

Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)