Table of Contents
Fetching ...

DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang

TL;DR

DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components, is proposed and experiments demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR.

Abstract

Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.

DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

TL;DR

DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components, is proposed and experiments demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR.

Abstract

Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.
Paper Structure (40 sections, 6 theorems, 89 equations, 7 figures, 3 tables)

This paper contains 40 sections, 6 theorems, 89 equations, 7 figures, 3 tables.

Key Result

Theorem 3.1

Fix a prompt $q$. Let $\bar{d}(q,o)\in[0,\sigma_d]$ denote a bounded (clipped) global diversity score for a completed rollout $o$ (e.g., Eq. eq:augmented_reward), and let $R(q,o)\in\{0,1\}$ be the verifiable reward. For $\tau>0$, define the correct-only diversity-tilted objective Assume $Z_\tau(\theta;q)>0$. Then the policy gradient of $J_\tau$ admits the form where the diversity-tilted advantag

Figures (7)

  • Figure 1: (Left): global-to-local coupling for enhanced exploration during RL training. (Right): baseline exploration collapses to local suboptimal solutions, while DSDR promotes diverse trajectories that escape local optima and reach the correct solution space.
  • Figure 2: DSDR training pipeline for dual-scale exploration in RL. Correct-only global diversity promotes exploration across solution modes, while a global-to-local coupling mechanism allocates length-invariant local entropy regularization to distinctive correct trajectories. Both signals are integrated into policy updates to enable deep exploration without sacrificing correctness.
  • Figure 3: Pass@k performance across five benchmarks for both Qwen3-1.7B and Qwen3-4B. The Base models serve as backbones. DSDR consistently outperforms both the Base models and DAPO across all values of $k$..
  • Figure 4: Training dynamics across methods conducted on Qwen3-1.7 model. From left to right, we report AIME2024 Avg@16, policy entropy, semantic-level diversity similarity, and formula-level diversity similarity. Results are shown for GRPO, DSDR, DSDR w/o GD, DSDR w/o GC, and DAPO.
  • Figure 5: We generate 32 test-time rollouts per problem on four benchmarks and evaluate response diversity using an LLM-as-a-Judge (1–10 scale). The figure reports diversity scores and corresponding pass@32 for DAPO and DSDR.
  • ...and 2 more figures

Theorems & Definitions (12)

  • Theorem 3.1: Diversity-tilted policy gradient induces DSDR global-to-local softmax coupling
  • Lemma 3.1: Inter-/intra-mode entropy decomposition
  • proof
  • Proposition 3.2: Correctness preservation under bounded $\lambda_\ell$
  • proof
  • Lemma 3.3: Probability of a mixed verifier-reward group
  • proof
  • Proposition 3.4: Non-vanishing GRPO signal under correct-only diversity bonus
  • proof
  • Proposition 3.5: Softmax allocation optimality
  • ...and 2 more