Table of Contents
Fetching ...

Data-Efficient RLVR via Off-Policy Influence Guidance

Erle Zhu, Dazhi Jiang, Yuan Wang, Xujun Li, Jiale Cheng, Yuxian Gu, Yilin Niu, Aohan Zeng, Jie Tang, Minlie Huang, Hongning Wang

TL;DR

This work tackles data selection for Reinforcement Learning with Verifiable Rewards (RLVR) in large language models by grounding data attribution in influence functions. It overcomes the rollout cost barrier with Off-Policy Gradient estimation from offline trajectories and employs Sparse Random Projection to manage high-dimensional gradients, enabling Practical Off-Policy Influence (POPI). Building on POPI, the authors introduce CROPI, a multi-stage curriculum RL framework that selects the most influential data per policy checkpoint to accelerate training. Experiments across 1.5B–7B models show CROPI delivers significant step- and data-efficiency gains, including a 2.66x speedup on 1.5B with only 10% data per phase, and improved generalization to untargeted tasks. The approach provides a theoretically grounded, scalable alternative to heuristic data selection for RLVR in large reasoning models.

Abstract

Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop \textbf{C}urriculum \textbf{R}L with \textbf{O}ff-\textbf{P}olicy \text{I}nfluence guidance (\textbf{CROPI}), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10\% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.

Data-Efficient RLVR via Off-Policy Influence Guidance

TL;DR

This work tackles data selection for Reinforcement Learning with Verifiable Rewards (RLVR) in large language models by grounding data attribution in influence functions. It overcomes the rollout cost barrier with Off-Policy Gradient estimation from offline trajectories and employs Sparse Random Projection to manage high-dimensional gradients, enabling Practical Off-Policy Influence (POPI). Building on POPI, the authors introduce CROPI, a multi-stage curriculum RL framework that selects the most influential data per policy checkpoint to accelerate training. Experiments across 1.5B–7B models show CROPI delivers significant step- and data-efficiency gains, including a 2.66x speedup on 1.5B with only 10% data per phase, and improved generalization to untargeted tasks. The approach provides a theoretically grounded, scalable alternative to heuristic data selection for RLVR in large reasoning models.

Abstract

Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop \textbf{C}urriculum \textbf{R}L with \textbf{O}ff-\textbf{P}olicy \text{I}nfluence guidance (\textbf{CROPI}), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10\% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.

Paper Structure

This paper contains 37 sections, 20 equations, 18 figures, 5 tables, 1 algorithm.

Figures (18)

  • Figure 1: Practical issues in computing influence for data point in RL training process for large-scale models.
  • Figure 2: The schematic of our proposed framework Curriculum RL with Off-Policy Influence Guidance (CROPI).
  • Figure 3: Training curves on 1.5B setting. CROPI surpasses all other baselines and achieves a significant step-level accelerate ratio $\bm{2.66\times}$ compared to full data training while using only 10% of data during each phase.
  • Figure 4: Rank preservation experiments for Sparse Random Projection.
  • Figure 5: Semantic similarity between top-100 and bottom-100 training prompts selected by POPI and the validation set.
  • ...and 13 more figures