Market-Driven Subset Selection for Budgeted Training
Ashish Jha, Valentin Leplat, AH Phan
TL;DR
This work introduces a market-based framework for data subset selection under fixed budget constraints, leveraging the Logarithmic Market Scoring Rule (LMSR) to coherently aggregate heterogeneous utility signals into prices over training examples. By modeling each example as a tradeable contract and incorporating token-aware and topic-separable pricing, the method achieves principled maximum-entropy aggregation with theoretical utility-recovery guarantees under noisy signals. Empirically, the approach matches or surpasses strong single-signal baselines on GSM8K under a 60k-token budget and delivers competitive, stable performance on AGNews across multiple retention rates, all with modest computational overhead (<0.1 GPU-hours for selection). The framework offers a practical, scalable path for multi-signal data curation in instruction tuning and prompt-level reasoning tasks, with extensions toward learned signals and multimodal data.
Abstract
Training large language models on massive datasets is computationally expensive, yet empirical evidence suggests that substantial portions of training examples contribute minimally to final performance. Data subset selection addresses this inefficiency by identifying small, high-utility subsets under resource constraints. However, example utility is inherently multi-faceted, encompassing uncertainty, distributional rarity, and diversity signals that are heterogeneous and typically combined through ad hoc weighted sums lacking theoretical grounding. We propose a market-based framework that treats each training example as a tradeable contract and employs the Logarithmic Market Scoring Rule to aggregate multiple utility signals into coherent prices. Heterogeneous signals act as traders, a single liquidity parameter controls concentration versus smoothing, and topic-wise normalization ensures calibrated aggregation. Token budgets are handled explicitly through a price-per-token decision rule with an interpretable length-bias parameter. We establish theoretical connections to maximum-entropy aggregation and provide utility recovery guarantees under noisy but monotone signals. On GSM8K mathematical reasoning under strict 60k-token budgets, our selector achieves parity with strong single-signal baselines while exhibiting lower variance and incurring less than 0.1 GPU-hour overhead. On AGNews classification at 5-25\% retention rates, the market formulation delivers competitive accuracy with improved stability. Our framework unifies multi-signal data curation under fixed computational budgets for prompt-level reasoning and classification tasks.
