Table of Contents
Fetching ...

Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Boxing Chen, Hao Yang, Bei Li, Tong Xiao, Jingbo Zhu

TL;DR

This paper proposes an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR), which utilizes small models and requires only 11.2% of the monetary cost compared to existing methods, making it easily deployable in industrial scenarios.

Abstract

With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required for training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being affected by biases in GPT models, or reducing the diversity of the selected instruction dataset. In this paper, we propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR). CaR employs a two-step process: first, it ranks instruction pairs using a high-accuracy (84.25%) scoring model aligned with expert preferences; second, it preserves dataset diversity through clustering. In our experiment, CaR efficiently selected a mere 1.96% of Alpaca's IT data, yet the resulting AlpaCaR model surpassed Alpaca's performance by an average of 32.1% in GPT-4 evaluations. Moreover, we find that data selecting is a consistent paradigm whether the pre-trained model is more capable or the model parameters scaling up. Our approach employs compact models with 550M parameters and incurs just 11.2% of the financial outlay of current methods, enhancing its industrial deployability.

Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

TL;DR

This paper proposes an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR), which utilizes small models and requires only 11.2% of the monetary cost compared to existing methods, making it easily deployable in industrial scenarios.

Abstract

With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required for training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being affected by biases in GPT models, or reducing the diversity of the selected instruction dataset. In this paper, we propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR). CaR employs a two-step process: first, it ranks instruction pairs using a high-accuracy (84.25%) scoring model aligned with expert preferences; second, it preserves dataset diversity through clustering. In our experiment, CaR efficiently selected a mere 1.96% of Alpaca's IT data, yet the resulting AlpaCaR model surpassed Alpaca's performance by an average of 32.1% in GPT-4 evaluations. Moreover, we find that data selecting is a consistent paradigm whether the pre-trained model is more capable or the model parameters scaling up. Our approach employs compact models with 550M parameters and incurs just 11.2% of the financial outlay of current methods, enhancing its industrial deployability.
Paper Structure (48 sections, 13 figures, 8 tables)

This paper contains 48 sections, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Compares the performance of the proposed AlpaCaR model to established baseline models over four test sets. Our AlpaCaR achieves the best model performance with the smallest amount of instruction tuning data.
  • Figure 2: An overview of Cluster and Ranking (CaR) method. Unlike directly training Alpaca with the entire Alpaca_52k dataset, CaR first uses the IQS model to score all instructions (brown arrow). Then it selects the top $n_1$ instructions ranked by quality. Next, a clustering model (violet arrow) groups all instructions into k clusters, selecting $n_2$ from each. These are concatenated and deduplicated to form a diverse, high-quality sub-dataset for training AlpaCaR.
  • Figure 3: Consistency between IQS scores and the performance of LLMs.
  • Figure 4: Model performances with varying $n_1$.
  • Figure 5: Performances with varying $n_2$.
  • ...and 8 more figures