Table of Contents
Fetching ...

Chasing Random: Instruction Selection Strategies Fail to Generalize

Harshita Diddee, Daphne Ippolito

TL;DR

The findings reveal that data selection can often exceed the cost of fine-tuning on the full dataset, yielding only marginal and sometimes no gains compared to tuning on the full dataset or a random subset.

Abstract

Prior work has shown that language models can be tuned to follow user instructions using only a small set of high-quality instructions. This has accelerated the development of methods that filter a large, noisy instruction-tuning datasets down to high-quality subset which works just as well. However, typically, the performance of these methods is not demonstrated across a uniform experimental setup and thus their generalization capabilities are not well established. In this work, we analyze popular selection strategies across different source datasets, selection budgets and evaluation benchmarks: Our results indicate that selection strategies generalize poorly, often failing to consistently outperform even random baselines. We also analyze the cost-performance trade-offs of using data selection. Our findings reveal that data selection can often exceed the cost of fine-tuning on the full dataset, yielding only marginal and sometimes no gains compared to tuning on the full dataset or a random subset.

Chasing Random: Instruction Selection Strategies Fail to Generalize

TL;DR

The findings reveal that data selection can often exceed the cost of fine-tuning on the full dataset, yielding only marginal and sometimes no gains compared to tuning on the full dataset or a random subset.

Abstract

Prior work has shown that language models can be tuned to follow user instructions using only a small set of high-quality instructions. This has accelerated the development of methods that filter a large, noisy instruction-tuning datasets down to high-quality subset which works just as well. However, typically, the performance of these methods is not demonstrated across a uniform experimental setup and thus their generalization capabilities are not well established. In this work, we analyze popular selection strategies across different source datasets, selection budgets and evaluation benchmarks: Our results indicate that selection strategies generalize poorly, often failing to consistently outperform even random baselines. We also analyze the cost-performance trade-offs of using data selection. Our findings reveal that data selection can often exceed the cost of fine-tuning on the full dataset, yielding only marginal and sometimes no gains compared to tuning on the full dataset or a random subset.

Paper Structure

This paper contains 31 sections, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Selection Cost Versus Performance on different benchmarks when selecting 10000 samples from DollyDatabricksBlog2023DollyV2: Upper Left Region (low cost, high performance) is ideal. Key Takeaways are: (a) Random baselines are reasonably competitive whilst incurring the least cost (b) Depending on the evaluation metric, the best strategy varies significantly with the setup ($\star$ indicates best selection strategy on the benchmark).
  • Figure 2: Mean Adjusted Win Rates on AlpacaEval for budgets (a) 1000 (b) 10000. A bar along the negative y-axis indicates that the $M_{\text{random}}$ responses are preferred more than 50% of the time by GPT-4. No strategy except $S_{\text{deita}}$ beats random baselines consistently. No strategy shows consistent performance trends across budgets as well (Section §\ref{['sec:random-baselines']}) for more details.
  • Figure 3: There is a stark difference between the performance trends of selection strategies depending upon what subset of OpenLLM tasks are chosen for evaluation. $S_{\text{random}}$ is the worst performing strategy across all datasets when performance is gauged on MMLU, while $S_{\text{random}}$ shows competitive performance as more tasks from OpenLLM are considered. Details in §\ref{['openllm-results']} and \ref{['fig:openllm-all']}.
  • Figure 4: Mean Instruct-Level Accuracy of $M_{\text{selected}}$ on IFEval versus Win-Rate on AlpacaEval: The correlation between Win-Rate and IFEval is entirely non-existent or weakly correlated at best. As budget increases these also appear to diverge: as performance drops on Win-Rate as IFEval accuracy improves. (§\ref{['sec:instruction-following-performance']} for further details.)
  • Figure 5: Performance on LLMBAR: Both $M_{\text{random}}$ and $M_{\text{selected}}$ consistently underperform $M_{\text{full-dataset}}$.
  • ...and 8 more figures