Table of Contents
Fetching ...

Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

Benjamin Pikus, Pratyush Ranjan Tiwari, Burton Ye

TL;DR

This work tackles data-efficiency in GRPO-based fine-tuning by proposing a budget-aware, offline subset selection framework that uses multi-sample probing to rank example difficulty. Across GSM8K and BIG-Bench Hard tracks with multiple model families, training on the hardest $p=10\%$ of prompts yields the largest gains, up to $\sim$47 percentage points, and better out-of-distribution generalization. The authors show that GRPO learning relies on within-group variance, which is preserved longer in hard examples, and that the majority of learning value concentrates in examples the base model initially mis solves (base wrong). These findings offer practical guidance for data collection: target prompts where the base model struggles to maximize learning efficiency and robustness in budget-constrained RLHF-style fine-tuning.

Abstract

Collecting high-quality training examples for language model fine-tuning is expensive, with practical budgets limiting the amount of data that can be procured. We investigate whether example difficulty affects GRPO training effectiveness by comparing selection strategies (easy, medium, hard, random) across multiple models and reasoning tasks. Training on the hardest 10\% of examples (those where the base model fails most often) yields dramatic performance gains up to 47\%, while easy examples produce minimal improvements of 3-15\%. This occurs because GRPO requires outcome variance to generate learning signals; hard examples maintain mixed success/failure outcomes throughout training while easy examples quickly converge to consistent success, eliminating learning opportunities. Moreover, models trained on hard examples show superior out-of-distribution generalization, with only hard-trained models achieving meaningful gains on the AIME2025 benchmark. Our findings provide clear guidance: when budget-constrained, prioritize collecting and annotating examples where your base model struggles, as these drive nearly all learning value in GRPO fine-tuning

Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

TL;DR

This work tackles data-efficiency in GRPO-based fine-tuning by proposing a budget-aware, offline subset selection framework that uses multi-sample probing to rank example difficulty. Across GSM8K and BIG-Bench Hard tracks with multiple model families, training on the hardest of prompts yields the largest gains, up to 47 percentage points, and better out-of-distribution generalization. The authors show that GRPO learning relies on within-group variance, which is preserved longer in hard examples, and that the majority of learning value concentrates in examples the base model initially mis solves (base wrong). These findings offer practical guidance for data collection: target prompts where the base model struggles to maximize learning efficiency and robustness in budget-constrained RLHF-style fine-tuning.

Abstract

Collecting high-quality training examples for language model fine-tuning is expensive, with practical budgets limiting the amount of data that can be procured. We investigate whether example difficulty affects GRPO training effectiveness by comparing selection strategies (easy, medium, hard, random) across multiple models and reasoning tasks. Training on the hardest 10\% of examples (those where the base model fails most often) yields dramatic performance gains up to 47\%, while easy examples produce minimal improvements of 3-15\%. This occurs because GRPO requires outcome variance to generate learning signals; hard examples maintain mixed success/failure outcomes throughout training while easy examples quickly converge to consistent success, eliminating learning opportunities. Moreover, models trained on hard examples show superior out-of-distribution generalization, with only hard-trained models achieving meaningful gains on the AIME2025 benchmark. Our findings provide clear guidance: when budget-constrained, prioritize collecting and annotating examples where your base model struggles, as these drive nearly all learning value in GRPO fine-tuning

Paper Structure

This paper contains 25 sections, 3 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Schematic overview of our experimental protocol
  • Figure 2: GRPO training dynamics reveal early and persistent advantages of hard example selection. Each subplot shows test accuracy over 1000 training steps for models trained on different difficulty-based subsets (10% of full data each). The hardest subset establishes superiority by step 300 and maintains this advantage.
  • Figure 3: Scatter plot showing the relationship between the percentage of learnable training examples and the resulting absolute improvement in model performance, across strategies and models. Colors indicate the strategy and marker shapes indicate the model. We see a strong positive correlation ($R^2=0.66$), indicating that performance improves with more learnable examples.