Table of Contents
Fetching ...

TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection

Xixiang He, Hao Yu, Qiyao Sun, Ao Cheng, Tailai Zhang, Cong Liu, Shuxuan Guo

TL;DR

The paper addresses the data-selection bottleneck in instruction fine-tuning by introducing TACOS, a two-part framework that enhances both data diversity and scoring reliability. Open Tagging leverages LLMs to generate open-domain tags, followed by normalization and clustering to preserve diverse instructional intents with compact representations. Comparative Scoring refines evaluation prompts and uses pairwise, within-cluster comparisons on a $[1,100]$ scale to align assessments with human criteria and reduce scoring bias. Across multiple LLM architectures and benchmarks (e.g., MT-Bench, AlpacaEval 2.0), TACOS delivers consistent improvements over state-of-the-art baselines and can significantly accelerate IFT training, with ablation studies confirming the contributions of each component. The work advances practical IFT data selection, providing open-source code and data for reproducibility and broader adoption.

Abstract

Instruction Fine-Tuning (IFT) is crucial for aligning large language models (LLMs) with human preferences, and selecting a small yet representative subset from massive data significantly facilitates IFT in terms of both efficiency and effectiveness. Nevertheless, existing approaches suffer from two limitations: the use of simple heuristics restricts data diversity, while the singleton data quality evaluation accounts for inconsistent criteria between independent samples. To address the issues, we present TACOS, an innovative method that integrates Open Tagging and Comparative Scoring for IFT data selection. To capture data diversity, we leverage LLMs to assign open-domain tags to human queries, followed by a normalization stage to denoise the open tags and enable efficient clustering. Additionally, we suggest a comparative scoring method that allows the relative quality evaluation of samples within a cluster, avoiding inconsistent criteria seen in singleton-based evaluations. Extensive experiments across diverse datasets and LLM architectures demonstrate that TACOS outperforms existing approaches by a large margin. Notably, it achieves superior instruction-following performance on MT-Bench and ranks 1st among LLaMA2-7B-Based models on AlpacaEval 2.0, illustrating its efficacy for IFT data selection.

TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection

TL;DR

The paper addresses the data-selection bottleneck in instruction fine-tuning by introducing TACOS, a two-part framework that enhances both data diversity and scoring reliability. Open Tagging leverages LLMs to generate open-domain tags, followed by normalization and clustering to preserve diverse instructional intents with compact representations. Comparative Scoring refines evaluation prompts and uses pairwise, within-cluster comparisons on a scale to align assessments with human criteria and reduce scoring bias. Across multiple LLM architectures and benchmarks (e.g., MT-Bench, AlpacaEval 2.0), TACOS delivers consistent improvements over state-of-the-art baselines and can significantly accelerate IFT training, with ablation studies confirming the contributions of each component. The work advances practical IFT data selection, providing open-source code and data for reproducibility and broader adoption.

Abstract

Instruction Fine-Tuning (IFT) is crucial for aligning large language models (LLMs) with human preferences, and selecting a small yet representative subset from massive data significantly facilitates IFT in terms of both efficiency and effectiveness. Nevertheless, existing approaches suffer from two limitations: the use of simple heuristics restricts data diversity, while the singleton data quality evaluation accounts for inconsistent criteria between independent samples. To address the issues, we present TACOS, an innovative method that integrates Open Tagging and Comparative Scoring for IFT data selection. To capture data diversity, we leverage LLMs to assign open-domain tags to human queries, followed by a normalization stage to denoise the open tags and enable efficient clustering. Additionally, we suggest a comparative scoring method that allows the relative quality evaluation of samples within a cluster, avoiding inconsistent criteria seen in singleton-based evaluations. Extensive experiments across diverse datasets and LLM architectures demonstrate that TACOS outperforms existing approaches by a large margin. Notably, it achieves superior instruction-following performance on MT-Bench and ranks 1st among LLaMA2-7B-Based models on AlpacaEval 2.0, illustrating its efficacy for IFT data selection.

Paper Structure

This paper contains 15 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Top: Comparison of different IFT data selection strategies, including: (a) original data based IFT, which is time-consuming and expensive, while leading to suboptimal performance; (b) selected data based IFT, which saves training time and expense, while improving IFT performance; (c) TACOS that introduces Open Tagging and Comparative Scoring to further boost the performance of data selection for IFT. Bottom: Quantitative results, including: (d) comparison between TACOS and a SOTA baseline for IFT data selection in terms of LLM performance, where our consistent higher win rate on variant LLMs demonstrates our superiority. (e) comparison between TACOS based and original data based IFT in terms of LLM fine-tuning time, where TACOS achieves 12x acceleration.
  • Figure 2: An overview of TACOS. Top: (a) Open Tagging. From left to right, a LLM is leveraged to generate open-domain tags for IFT datasets, followed by normalization and clustering to group similar samples, ensuring data diversity and efficiency simultaneously. Bottom: (b) Comparative Scoring. From right to left, a LLM is used to perform comparative scoring within each cluster to obtain consistent criteria and reliable scores for high-quality IFT data selection.
  • Figure 3: Template of Comparative Scoring Prompt. The LLM evaluates pairwise data in a comparative fashion, scoring each on a range of [1, 100]. Feedback ensures accurate assessment.
  • Figure 4: Preference evaluation Results (in %). Rows represent five test sets, and columns stand for four baseline methods. The results present the win, tie, and lose rates of TACOS versus baselines. Top: comparisons on the Alpaca-52k dataset with LLaMA2-7B. Bottom: comparisons on the Evol-Instruct-70k dataset with LLaMA2-7B. Results demonstrate that our approach consistently yields higher preference scores compared to existing methods.
  • Figure 5: Distribution of Tags before and after Normalization. The introduced normalization procedures compress the size of the original tag set from around 50k to less than 6k.
  • ...and 1 more figures