Table of Contents
Fetching ...

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

Germán Kruszewski, Pierre Erbacher, Jos Rozen, Marc Dymetman

TL;DR

The paper argues that RL-based tuning of LLMs to solve reasoning tasks tends to reduce diversity due to mode-seeking optimization of Reverse KL. It introduces Distributional Matching with Verifiable Rewards (DMVR), which targets an explicit verifier-driven distribution while staying close to the base model, and unifies this with α-DPG to trade off precision and coverage. By analyzing the RLVR-to-DMVR relationship and leveraging α-divergences, the authors demonstrate a Pareto frontier on Lean theorem proving, with intermediate α values yielding substantial gains in coverage without sacrificing accuracy. The framework clarifies why diversity collapses under traditional RL approaches and provides a principled path to preserve breadth of solutions in formal reasoning tasks. Overall, DMVR and α-DPG offer a flexible, theoretically grounded approach to balancing correctness and diversity in verifiable reasoning tasks, with practical implications for scalable, diverse solution discovery.

Abstract

Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning. However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity. We argue that this arises because RL implicitly optimizes the "mode-seeking" or "zero-forcing" Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others. In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones. Starting from a pre-trained LLM, we approximate this target distribution using the $α$-divergence family, which unifies prior approaches and enables direct control of the precision-diversity trade-off by interpolating between mode-seeking and mass-covering divergences. On a Lean theorem-proving benchmark, our method achieves state-of-the-art performance along the coverage-precision Pareto frontier, outperforming all prior methods on the coverage axis.

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

TL;DR

The paper argues that RL-based tuning of LLMs to solve reasoning tasks tends to reduce diversity due to mode-seeking optimization of Reverse KL. It introduces Distributional Matching with Verifiable Rewards (DMVR), which targets an explicit verifier-driven distribution while staying close to the base model, and unifies this with α-DPG to trade off precision and coverage. By analyzing the RLVR-to-DMVR relationship and leveraging α-divergences, the authors demonstrate a Pareto frontier on Lean theorem proving, with intermediate α values yielding substantial gains in coverage without sacrificing accuracy. The framework clarifies why diversity collapses under traditional RL approaches and provides a principled path to preserve breadth of solutions in formal reasoning tasks. Overall, DMVR and α-DPG offer a flexible, theoretically grounded approach to balancing correctness and diversity in verifiable reasoning tasks, with practical implications for scalable, diverse solution discovery.

Abstract

Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning. However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity. We argue that this arises because RL implicitly optimizes the "mode-seeking" or "zero-forcing" Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others. In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones. Starting from a pre-trained LLM, we approximate this target distribution using the -divergence family, which unifies prior approaches and enables direct control of the precision-diversity trade-off by interpolating between mode-seeking and mass-covering divergences. On a Lean theorem-proving benchmark, our method achieves state-of-the-art performance along the coverage-precision Pareto frontier, outperforming all prior methods on the coverage axis.

Paper Structure

This paper contains 46 sections, 6 theorems, 49 equations, 15 figures, 6 tables.

Key Result

Lemma 1

Define Then,

Figures (15)

  • Figure 2: Left: Illustrative representation of our method. GRPO/PPO and other policy-gradient methods used in RLVR focus the model on a small region of the target distribution. Other methods, such as KL-DPG, recover more of the diversity at the cost of putting probability mass to low-quality regions. $\alpha$-DPG allows to strike a balance between the two. Right: Estimates of models precision (pass@1) and coverage (pass@256). $\alpha$-DPG models sit along a Pareto frontier.
  • Figure 3: Pass@$k$ curves on the test set for the Base-SFT model tuned with different methods.
  • Figure 4: Problem Difficulty Transition Matrix from the Base-SFT to GRPO. The matrix shows the number of problems that transition from an initial difficulty classification under the base model (Base-SFT) (y-axis) to a final classification after post-training (x-axis). The results highlight a polarizing effect: $\alpha$-DPG ($\alpha=0.99$) and GRPO exhibit similar behavior, improving performance on a majority of medium-difficulty problems by making them easy, but also degrading performance on hard problems, causing nearly a half of them to become unsolved. $\alpha$-DPG ($\alpha = 0.5$) and GRPO (High-KL) are more conservative, improving sample efficiency on fewer problems but harder problems remain solvable.
  • Figure 5: Left: Relationship between premise diversity measured by Shannon index and model performance (pass@1 and pass@256). The quadratic regression lines are computed for $\alpha$-DPG models. The left y-axis shows pass@1 performance, and the right y-axis shows pass@256 performance. Right: Perplexity analysis showing the distribution of perplexity for responses to a single problem sampled from various models under the base SFT model distribution.
  • Figure 6: Training curves of both $\alpha$-DPG and dr-GRPO. Sequence entropy on the right and reward on the left
  • ...and 10 more figures

Theorems & Definitions (11)

  • Lemma 1
  • Lemma 2
  • Proposition 1
  • Lemma 3
  • proof
  • Lemma 4: Connection to Hellinger sum
  • proof
  • Theorem 5: Support Decomposition
  • proof
  • Remark 1: Limit $\alpha \to 1$: The Strong Constraint
  • ...and 1 more