Table of Contents
Fetching ...

GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning

Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang

TL;DR

GradAlign is proposed, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum.

Abstract

Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non-stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at https://github.com/StigLidu/GradAlign

GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning

TL;DR

GradAlign is proposed, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum.

Abstract

Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non-stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at https://github.com/StigLidu/GradAlign
Paper Structure (37 sections, 2 theorems, 9 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 37 sections, 2 theorems, 9 equations, 4 figures, 8 tables, 1 algorithm.

Key Result

Theorem 4.1

Under on-policy sampling, binary rewards, unbiased advantage estimation without normalization, and ignoring KL regularization and clipping, the GRPO gradient is an unbiased estimator of the gradient of expected accuracy.

Figures (4)

  • Figure 1: Overview of GradAlign. GradAlign uses a small validation set to estimate a coherent target gradient direction and scores large-scale training candidates by gradient alignment, selecting the top-ranked fraction to form an adaptive RL online-learning curriculum.
  • Figure 2: Illustration of three challenging data-selection scenarios. Each panel shows a failure mode where accuracy-based filtering fails to identify training samples that improve downstream performance.
  • Figure 3: Training Accuracy Curve on AIME2425, AMC22 and AMC23 (Scenario 1). GradAlign (ours) achieves the strongest performance.
  • Figure 4: Distribution of cosine similarity and inner product similarity. Cosine similarity is more indicative than the inner product regarding detecting corrupted instances.

Theorems & Definitions (4)

  • Theorem 4.1
  • proof
  • Theorem 4.2
  • proof