Table of Contents
Fetching ...

Compute-Constrained Data Selection

Junjie Oscar Yin, Alexander M. Rush

TL;DR

The paper introduces compute-constrained data selection for finetuning LLMs, formalizing a budget-aware objective that trades data-selection cost against training gain. By evaluating six data-selection methods across multiple model sizes and three downstream tasks, the study finds that inexpensive methods (e.g., BM25, embedding-based) often dominate in compute-optimal regimes, while more expensive perplexity/gradient-based approaches only become advantageous when the training compute far exceeds the selection cost. A parametric model is proposed to relate compute to performance, enabling extrapolation of compute-optimal training-to-selector size ratios, which the authors estimate as $5\times$ for perplexity-based and $10\times$ for gradient-based methods. The results advocate prioritizing cheap data-selection strategies in practical compute-constrained finetuning, while also providing a framework and empirical benchmarks to guide future cheaper data-selection research. The work significantly informs practitioners about resource allocation and highlights opportunities to improve efficiency in instruction-tuning pipelines.

Abstract

Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective. For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.

Compute-Constrained Data Selection

TL;DR

The paper introduces compute-constrained data selection for finetuning LLMs, formalizing a budget-aware objective that trades data-selection cost against training gain. By evaluating six data-selection methods across multiple model sizes and three downstream tasks, the study finds that inexpensive methods (e.g., BM25, embedding-based) often dominate in compute-optimal regimes, while more expensive perplexity/gradient-based approaches only become advantageous when the training compute far exceeds the selection cost. A parametric model is proposed to relate compute to performance, enabling extrapolation of compute-optimal training-to-selector size ratios, which the authors estimate as for perplexity-based and for gradient-based methods. The results advocate prioritizing cheap data-selection strategies in practical compute-constrained finetuning, while also providing a framework and empirical benchmarks to guide future cheaper data-selection research. The work significantly informs practitioners about resource allocation and highlights opportunities to improve efficiency in instruction-tuning pipelines.

Abstract

Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective. For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.

Paper Structure

This paper contains 44 sections, 10 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Simulation of Performance under Constraints.$P(k) = \bar{P} \times \left( 1 - \exp\left( -\lambda \frac{C(k)}{C(|\mathcal{D}|)} \right) \right)$ The behavior of different data selection methods using our performance model. (Left-Small Budget) The Lexicon method may consistently outperform more advanced data selection methods if their initial cost is too high. Under our assumptions gradient can never be optimal as its cost exceeds 1 epoch of training. (Middle-Medium Budget) The perplexity method can become optimal once the total cost exceeds a given amount. (Right-Large Budget) The gradient methods can be optimal if training is more expensive than the fixed-cost, for example if using a much larger base model than data selection model. The simulation shows that the compute-optimal data selection method changes as a function of the compute budget and the performance rate associated with each method.
  • Figure 2: Performance for Different Data Selection Methods. We show all of our different runs for a given model size, where each scatter point is the final target task performance of a single run. (A, B, C) show MMLU results across three model sizes, while (D, E, F) present BBH results across three model sizes. For each run, we determine the optimal finetuning strategy---a combination of data selection method and number of finetuning tokens---that achieves the highest performance under a particular FLOPs budget. We fit a pareto front in dashed line based on these optimal strategies, which is a line in the linear-log space. At small and medium compute budgets (A, B, D, E), cheaper data selection methods like BM25 and EMBED outperform PPL and LESS, which rely on model information. At larger compute budgets (C, F), however, PPL and LESS become compute-optimal after using 5% of the fine-tuning tokens.
  • Figure 3: Parametric Fit of Performance with Compute-Constrained Data Selection. We fit a parametric model of the performance in \ref{['eq:parametric-function']} and display that as curves to pair with the empirical results as scatter points. (A, B, C) show MMLU results and their parametric fit across three model sizes, while (D, E, F) present BBH results and their parametric fit across three model sizes.
  • Figure 4: Multiple Task-Specific Model Break-Even Analysis . Costs to perform gradient-based method (LESS) are spread over all the target tasks. Performance under compute-constraints reach the finetuned Pareto frontier at 10 tasks, surpassing it at 20 tasks.
  • Figure 5: (a) Fixed Training Budget. Considering only training budget, sophisticated methods consistently outperforms cheaper methods. (b) Performance and Parametric Fit on IFEval. At small compute budget, sophisticated methods are not compute optimal.
  • ...and 12 more figures