Table of Contents
Fetching ...

Market-Driven Subset Selection for Budgeted Training

Ashish Jha, Valentin Leplat, AH Phan

TL;DR

This work introduces a market-based framework for data subset selection under fixed budget constraints, leveraging the Logarithmic Market Scoring Rule (LMSR) to coherently aggregate heterogeneous utility signals into prices over training examples. By modeling each example as a tradeable contract and incorporating token-aware and topic-separable pricing, the method achieves principled maximum-entropy aggregation with theoretical utility-recovery guarantees under noisy signals. Empirically, the approach matches or surpasses strong single-signal baselines on GSM8K under a 60k-token budget and delivers competitive, stable performance on AGNews across multiple retention rates, all with modest computational overhead (<0.1 GPU-hours for selection). The framework offers a practical, scalable path for multi-signal data curation in instruction tuning and prompt-level reasoning tasks, with extensions toward learned signals and multimodal data.

Abstract

Training large language models on massive datasets is computationally expensive, yet empirical evidence suggests that substantial portions of training examples contribute minimally to final performance. Data subset selection addresses this inefficiency by identifying small, high-utility subsets under resource constraints. However, example utility is inherently multi-faceted, encompassing uncertainty, distributional rarity, and diversity signals that are heterogeneous and typically combined through ad hoc weighted sums lacking theoretical grounding. We propose a market-based framework that treats each training example as a tradeable contract and employs the Logarithmic Market Scoring Rule to aggregate multiple utility signals into coherent prices. Heterogeneous signals act as traders, a single liquidity parameter controls concentration versus smoothing, and topic-wise normalization ensures calibrated aggregation. Token budgets are handled explicitly through a price-per-token decision rule with an interpretable length-bias parameter. We establish theoretical connections to maximum-entropy aggregation and provide utility recovery guarantees under noisy but monotone signals. On GSM8K mathematical reasoning under strict 60k-token budgets, our selector achieves parity with strong single-signal baselines while exhibiting lower variance and incurring less than 0.1 GPU-hour overhead. On AGNews classification at 5-25\% retention rates, the market formulation delivers competitive accuracy with improved stability. Our framework unifies multi-signal data curation under fixed computational budgets for prompt-level reasoning and classification tasks.

Market-Driven Subset Selection for Budgeted Training

TL;DR

This work introduces a market-based framework for data subset selection under fixed budget constraints, leveraging the Logarithmic Market Scoring Rule (LMSR) to coherently aggregate heterogeneous utility signals into prices over training examples. By modeling each example as a tradeable contract and incorporating token-aware and topic-separable pricing, the method achieves principled maximum-entropy aggregation with theoretical utility-recovery guarantees under noisy signals. Empirically, the approach matches or surpasses strong single-signal baselines on GSM8K under a 60k-token budget and delivers competitive, stable performance on AGNews across multiple retention rates, all with modest computational overhead (<0.1 GPU-hours for selection). The framework offers a practical, scalable path for multi-signal data curation in instruction tuning and prompt-level reasoning tasks, with extensions toward learned signals and multimodal data.

Abstract

Training large language models on massive datasets is computationally expensive, yet empirical evidence suggests that substantial portions of training examples contribute minimally to final performance. Data subset selection addresses this inefficiency by identifying small, high-utility subsets under resource constraints. However, example utility is inherently multi-faceted, encompassing uncertainty, distributional rarity, and diversity signals that are heterogeneous and typically combined through ad hoc weighted sums lacking theoretical grounding. We propose a market-based framework that treats each training example as a tradeable contract and employs the Logarithmic Market Scoring Rule to aggregate multiple utility signals into coherent prices. Heterogeneous signals act as traders, a single liquidity parameter controls concentration versus smoothing, and topic-wise normalization ensures calibrated aggregation. Token budgets are handled explicitly through a price-per-token decision rule with an interpretable length-bias parameter. We establish theoretical connections to maximum-entropy aggregation and provide utility recovery guarantees under noisy but monotone signals. On GSM8K mathematical reasoning under strict 60k-token budgets, our selector achieves parity with strong single-signal baselines while exhibiting lower variance and incurring less than 0.1 GPU-hour overhead. On AGNews classification at 5-25\% retention rates, the market formulation delivers competitive accuracy with improved stability. Our framework unifies multi-signal data curation under fixed computational budgets for prompt-level reasoning and classification tasks.

Paper Structure

This paper contains 35 sections, 4 theorems, 20 equations, 3 figures, 3 tables, 1 algorithm.

Key Result

Proposition 4.1

Let $s^{(m)}_i$ denote signal $m$ evaluated on example $i$. Consider the optimization problem: where $H(p) = -\sum_i p_i \log p_i$ is the Shannon entropy and $\bar{s}^{(m)}$ are target moments. The unique solution has the exponential form: where $\lambda_m$ are Lagrange multipliers. Setting $\lambda_m = w_m/\beta$ recovers LMSR prices from Equation eq:lmsr-price.

Figures (3)

  • Figure 1: End-to-end market-based data selection pipeline: heterogeneous signals $\to$ LMSR pricing $\to$ token-aware selection $\to$ training and evaluation.
  • Figure 2: LMSR mechanics: (left) cost $C_\beta(q)$; (right) implied prices $p(q)=\mathrm{softmax}(q/\beta)$. Larger $\beta$ flattens curvature and smooths prices.
  • Figure 3: LMSR-based aggregation of heterogeneous signals. Liquidity $\beta$ controls smoothing of prices, interpolating between concentrated and diffuse mixes.

Theorems & Definitions (4)

  • Proposition 4.1: Maximum-Entropy Aggregation
  • Proposition 4.2: Utility Recovery with Weak Signals
  • Proposition 4.3: Coverage with Diversity Signal
  • Lemma 4.4: Bounded Influence Under Corruption