Table of Contents
Fetching ...

T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning

Yanjun Fu, Faisal Hamman, Sanghamitra Dutta

TL;DR

This work tackles data efficiency in instruction tuning by introducing token-level quality assessment and robustness-aware data selection. It defines Selective IFD (S-IFD) to weight only informative tokens and employs a hierarchical neighborhood-based strategy to pick samples with consistently high quality across perturbations. Empirical results show T-Shirt surpasses baselines across multiple datasets and benchmarks using as little as 5% of the data, with modest runtime on a single GPU. The method remains cost-effective, avoiding API-based scoring, and scales across models and datasets, highlighting the importance of fine-grained, robust data selection in instruction-tuning pipelines.

Abstract

Instruction tuning is essential for Large Language Models (LLMs) to effectively follow user instructions. To improve training efficiency and reduce data redundancy, recent works use LLM-based scoring functions, e.g., Instruction-Following Difficulty (IFD), to select high-quality instruction-tuning data with scores above a threshold. While these data selection methods often lead to models that can match or even exceed the performance of models trained on the full datasets, we identify two key limitations: (i) they assess quality at the sample level, ignoring token-level informativeness; and (ii) they overlook the robustness of the scoring method, often selecting a sample due to superficial lexical features instead of its true quality. In this work, we propose Token-Selective HIeRarchical Data Selection for Instruction Tuning (T-SHIRT), a novel data selection framework that introduces a new scoring method to include only informative tokens in quality evaluation and also promotes robust and reliable samples whose neighbors also show high quality with less local inconsistencies. We demonstrate that models instruction-tuned on a curated dataset (only 5% of the original size) using T-SHIRT can outperform those trained on the entire large-scale dataset by up to 5.48 points on average across eight benchmarks. Across various LLMs and training set scales, our method consistently surpasses existing state-of-the-art data selection techniques, while also remaining both cost-effective and highly efficient. For instance, by using GPT-2 for score computation, we are able to process a dataset of 52k samples in 40 minutes on a single GPU. Our code is available at https://github.com/Dynamite321/T-SHIRT.

T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning

TL;DR

This work tackles data efficiency in instruction tuning by introducing token-level quality assessment and robustness-aware data selection. It defines Selective IFD (S-IFD) to weight only informative tokens and employs a hierarchical neighborhood-based strategy to pick samples with consistently high quality across perturbations. Empirical results show T-Shirt surpasses baselines across multiple datasets and benchmarks using as little as 5% of the data, with modest runtime on a single GPU. The method remains cost-effective, avoiding API-based scoring, and scales across models and datasets, highlighting the importance of fine-grained, robust data selection in instruction-tuning pipelines.

Abstract

Instruction tuning is essential for Large Language Models (LLMs) to effectively follow user instructions. To improve training efficiency and reduce data redundancy, recent works use LLM-based scoring functions, e.g., Instruction-Following Difficulty (IFD), to select high-quality instruction-tuning data with scores above a threshold. While these data selection methods often lead to models that can match or even exceed the performance of models trained on the full datasets, we identify two key limitations: (i) they assess quality at the sample level, ignoring token-level informativeness; and (ii) they overlook the robustness of the scoring method, often selecting a sample due to superficial lexical features instead of its true quality. In this work, we propose Token-Selective HIeRarchical Data Selection for Instruction Tuning (T-SHIRT), a novel data selection framework that introduces a new scoring method to include only informative tokens in quality evaluation and also promotes robust and reliable samples whose neighbors also show high quality with less local inconsistencies. We demonstrate that models instruction-tuned on a curated dataset (only 5% of the original size) using T-SHIRT can outperform those trained on the entire large-scale dataset by up to 5.48 points on average across eight benchmarks. Across various LLMs and training set scales, our method consistently surpasses existing state-of-the-art data selection techniques, while also remaining both cost-effective and highly efficient. For instance, by using GPT-2 for score computation, we are able to process a dataset of 52k samples in 40 minutes on a single GPU. Our code is available at https://github.com/Dynamite321/T-SHIRT.

Paper Structure

This paper contains 30 sections, 4 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: An overview of our approach, T-Shirt.$W_E$ denotes the model’s embedding layer (e.g., GPT-2) used to compute S-IFD scores. For each instruction-response pair, we generate its neighbors by perturbing token embeddings. Then we only use selected response tokens (green squares) for S-IFD computation, while excluded tokens are marked in red squares. Finally, we use hierarchical selection to choose samples whose neighbors exhibit high average S-IFD and low variance.
  • Figure 2: Two examples from the Alpaca-GPT-4 dataset with nearly identical IFD scores but markedly different $\text{S-IFD}_{75}$ scores.Top: Instructions and partial responses from two examples, with tokens highlighted where $|\Delta_t| \le 0.01$. Bottom: Plots of $|\Delta_t|$ values corresponding to the same examples. Tokens highlighted above are marked as red dots in the plots.
  • Figure 3: Sensitivity of IFD and $\text{S-IFD}_{75}$ scores to a semantics-preserving word substitution.Top: An instruction-response pair with a blank, filled with either "average" or "mean". Bottom: Plot of $|\Delta_t|$ values for each variant.
  • Figure 4: Illustration of the selection policy $\pi$. Each circle represents an instruction-response pair in the embedding space. Existing methods distinguish between high- and low-quality training data using a fixed threshold.
  • Figure 5: CDF of $|\Delta_t|$ (\ref{['eq:rewrite_ifd']}) for response tokens in Alpaca-GPT-4, computed using GPT-2. The red dash line indicates $|\Delta_t| = 0.01$.
  • ...and 2 more figures