Table of Contents
Fetching ...

A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)

Nihal V. Nayak, Paula Rodriguez-Diaz, Neha Hulkund, Sara Beery, David Alvarez-Melis

TL;DR

It is found that only gradient-based data representations choose subsets whose similarity to the query consistently predicts performance across datasets and models, and several existing selection algorithms are unified as forms of approximate distance minimization between the selected subset and the query set.

Abstract

Instruction fine-tuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on targeted instruction selection remains fragmented and opaque: methods vary widely in selection budgets, often omit zero-shot baselines, and frequently entangle the contributions of key components. As a result, practitioners lack actionable guidance on selecting instructions for their target tasks. In this work, we aim to bring clarity to this landscape by disentangling and systematically analyzing the two core ingredients: data representation and selection algorithms. Our framework enables controlled comparisons across models, tasks, and budgets. We find that only gradient-based data representations choose subsets whose similarity to the query consistently predicts performance across datasets and models. While no single method dominates, gradient-based representations paired with a greedy round-robin selection algorithm tend to perform best on average at low budgets, but these benefits diminish at larger budgets. Finally, we unify several existing selection algorithms as forms of approximate distance minimization between the selected subset and the query set, and support this view with new generalization bounds. More broadly, our findings provide critical insights and a foundation for more principled data selection in LLM fine-tuning. The code is available at https://github.com/dcml-lab/targeted-instruction-selection.

A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)

TL;DR

It is found that only gradient-based data representations choose subsets whose similarity to the query consistently predicts performance across datasets and models, and several existing selection algorithms are unified as forms of approximate distance minimization between the selected subset and the query set.

Abstract

Instruction fine-tuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on targeted instruction selection remains fragmented and opaque: methods vary widely in selection budgets, often omit zero-shot baselines, and frequently entangle the contributions of key components. As a result, practitioners lack actionable guidance on selecting instructions for their target tasks. In this work, we aim to bring clarity to this landscape by disentangling and systematically analyzing the two core ingredients: data representation and selection algorithms. Our framework enables controlled comparisons across models, tasks, and budgets. We find that only gradient-based data representations choose subsets whose similarity to the query consistently predicts performance across datasets and models. While no single method dominates, gradient-based representations paired with a greedy round-robin selection algorithm tend to perform best on average at low budgets, but these benefits diminish at larger budgets. Finally, we unify several existing selection algorithms as forms of approximate distance minimization between the selected subset and the query set, and support this view with new generalization bounds. More broadly, our findings provide critical insights and a foundation for more principled data selection in LLM fine-tuning. The code is available at https://github.com/dcml-lab/targeted-instruction-selection.
Paper Structure (70 sections, 5 theorems, 27 equations, 18 figures, 3 tables, 1 algorithm)

This paper contains 70 sections, 5 theorems, 27 equations, 18 figures, 3 tables, 1 algorithm.

Key Result

Theorem 6.1

Let $\ell:\Theta\times\mathcal{Z}\to\mathbb R_+$ be a loss function, where $\mathcal{Z} \subset \mathbb{R}^d$ denotes the data space and $\Theta$ the parameter space. Assume that $\ell$ is symmetric, convex, bounded, satisfies the triangle inequality, and for $z=(x,y) \in \mathcal{Z}$ admits the par where $W_1$ is the 1-Wasserstein distance, $\zeta$ is a constant given by $\zeta := B^{-\frac{1}{2}

Figures (18)

  • Figure 1: Disentangled view of targeted instruction selection. First, the query set (stars) and candidate pool (dots) are encoded as data representations. Then, for a given budget, using the data representations for the query and candidates, we perform targeted selection (denoted by the dotted line) using a selection algorithm such as greedy round-robin.
  • Figure 2: Query loss vs. subset-query distance quantile. We stratify candidates into 10 distance quantiles (1 = closest, 10 = farthest) using each representation, select 500 examples per quantile using the RR selection algorithm, and train the Llama 2 7B model. We report query-set cross-entropy loss and Spearman correlation per target task. LESS (RR) exhibits a strong monotonic increase in loss with distance (high positive Spearman correlation), whereas RDS+ (RR) and EMBED (RR) show weak or inconsistent correlations.
  • Figure 3: Downstream performance vs. subset-query distance quantile. Using the same quantile construction and training protocol as Figure \ref{['fig:distance_quantile_query_loss']}, we evaluate downstream task performance across distance quantiles and report Spearman correlation per target task. LESS (RR) shows a strong negative correlation across most target tasks, while RDS+ (RR) and EMBED (RR) exhibit weaker, less consistent trends.
  • Figure 4: Query loss vs. budget for different data representations (fixed selection algorithm). Using greedy round-robin selection and the query-candidate pool similarity, we select subsets of size $B\in\{500,1000,2500,5000,10000\}$, train Llama 2 7B on them, and report average cross entropy loss averaged across three seeds and the standard error. Random averages over three uniformly sampled subsets from the candidate pool. LESS (RR) achieves the lowest loss across target tasks, while RDS+ (RR) and EMBED (RR) can underperform Random at larger budgets.
  • Figure 5: Downstream performance vs. budget for different data representations (fixed selection algorithm). With the same greedy round-robin selection and budgets as Figure \ref{['fig:ce_loss_data_rep_llama_7b']}, we report the downstream performance for different data representations averaged across three random seeds and the standard error. LESS (RR) performs best on BBH, TyDiQA, and MMLU-Pro, whereas RDS+ (RR) performs the best on GSM8K and is competitive with Random on Codex.
  • ...and 13 more figures

Theorems & Definitions (9)

  • Theorem 6.1
  • Theorem 6.2
  • Lemma 12.1: Domain adaptation bound; adapted from Theorem 2 in redko:ecml17
  • Lemma 12.2: Wasserstein stability bound
  • proof
  • Lemma 12.3: High-probability bound for the Wasserstein distance of an empirical measure
  • proof
  • proof
  • proof