Table of Contents
Fetching ...

Efficient RLVR Training via Weighted Mutual Information Data Selection

Xinyu Zhou, Boyu Zhu, Haotian Zhang, Huiming Wang, Zhijiang Guo

TL;DR

InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective, is introduced, showing that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection.

Abstract

Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints' success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.

Efficient RLVR Training via Weighted Mutual Information Data Selection

TL;DR

InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective, is introduced, showing that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection.

Abstract

Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints' success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.
Paper Structure (25 sections, 1 theorem, 33 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 25 sections, 1 theorem, 33 equations, 4 figures, 7 tables, 1 algorithm.

Key Result

Proposition 5.1

Let $\Phi_{\tau}\sim \text{Beta}(\alpha_{\tau},\beta_{\tau})$ denote the latent success rate of a datapoint $\tau$, and $R\in\{0,1\}$ be a Bernoulli reward conditioned on $\Phi_{\tau}$. Define $n_{\tau}=\alpha_{\tau}+\beta_{\tau}$. Then, as $n_{\tau}\rightarrow\infty$, we have

Figures (4)

  • Figure 1: Overview of the InSight pipeline. InSight maintains a Bayesian belief over data success rates, scores candidate datapoints using Weighted Mutual Information (WMI), and selects the top-$M$ datapoints for RL training. Observed rewards update the posterior beliefs, enabling adaptive data selection that jointly accounts for data difficulty and accumulated evidence, unlike difficulty-only heuristics.
  • Figure 2: Expected variance reduction as a function of prior mean $\bar{\phi}_\tau$ and accumulated evidence $n$. While difficulty-based heuristics focus solely on $\bar{\phi}_\tau \approx 0.5$, the expected uncertainty reduction decays rapidly with evidence, revealing a fundamental limitation of difficulty-only selection.
  • Figure 3: Weighted functions $w(\bar{\phi})$ under different $\eta, \mu$.
  • Figure 4: Performance comparisons among different methods on the Countdown task. Our proposed InSight outperforms the existing SOTA (MoPPS) and other baselines in both training efficiency and performance.

Theorems & Definitions (1)

  • Proposition 5.1