Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets
Benjamin Pikus, Pratyush Ranjan Tiwari, Burton Ye
TL;DR
This work tackles data-efficiency in GRPO-based fine-tuning by proposing a budget-aware, offline subset selection framework that uses multi-sample probing to rank example difficulty. Across GSM8K and BIG-Bench Hard tracks with multiple model families, training on the hardest $p=10\%$ of prompts yields the largest gains, up to $\sim$47 percentage points, and better out-of-distribution generalization. The authors show that GRPO learning relies on within-group variance, which is preserved longer in hard examples, and that the majority of learning value concentrates in examples the base model initially mis solves (base wrong). These findings offer practical guidance for data collection: target prompts where the base model struggles to maximize learning efficiency and robustness in budget-constrained RLHF-style fine-tuning.
Abstract
Collecting high-quality training examples for language model fine-tuning is expensive, with practical budgets limiting the amount of data that can be procured. We investigate whether example difficulty affects GRPO training effectiveness by comparing selection strategies (easy, medium, hard, random) across multiple models and reasoning tasks. Training on the hardest 10\% of examples (those where the base model fails most often) yields dramatic performance gains up to 47\%, while easy examples produce minimal improvements of 3-15\%. This occurs because GRPO requires outcome variance to generate learning signals; hard examples maintain mixed success/failure outcomes throughout training while easy examples quickly converge to consistent success, eliminating learning opportunities. Moreover, models trained on hard examples show superior out-of-distribution generalization, with only hard-trained models achieving meaningful gains on the AIME2025 benchmark. Our findings provide clear guidance: when budget-constrained, prioritize collecting and annotating examples where your base model struggles, as these drive nearly all learning value in GRPO fine-tuning
