Compute-Constrained Data Selection
Junjie Oscar Yin, Alexander M. Rush
TL;DR
The paper introduces compute-constrained data selection for finetuning LLMs, formalizing a budget-aware objective that trades data-selection cost against training gain. By evaluating six data-selection methods across multiple model sizes and three downstream tasks, the study finds that inexpensive methods (e.g., BM25, embedding-based) often dominate in compute-optimal regimes, while more expensive perplexity/gradient-based approaches only become advantageous when the training compute far exceeds the selection cost. A parametric model is proposed to relate compute to performance, enabling extrapolation of compute-optimal training-to-selector size ratios, which the authors estimate as $5\times$ for perplexity-based and $10\times$ for gradient-based methods. The results advocate prioritizing cheap data-selection strategies in practical compute-constrained finetuning, while also providing a framework and empirical benchmarks to guide future cheaper data-selection research. The work significantly informs practitioners about resource allocation and highlights opportunities to improve efficiency in instruction-tuning pipelines.
Abstract
Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective. For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.
