Table of Contents
Fetching ...

Diversity-Incentivized Exploration for Versatile Reasoning

Zican Hu, Shilin Zhang, Yafu Li, Jianhao Yan, Xuyang Hu, Leyang Cui, Xiaoye Qu, Chunlin Chen, Yu Cheng, Zhi Wang

TL;DR

The paper introduces DIVER, a Diversity-Incentivized Exploration framework for Verifiable RL in LLM reasoning, focusing on global sequence-level diversity to drive deep exploration. It defines Textual Diversity and Equational Diversity as semantically structured metrics, and uses a potential-based intrinsic reward R_int to shape learning while preserving optimal policy invariance. Conditional shaping and clipping mitigate reward hacking, enabling a balance between correctness and diverse reasoning paths. Empirical results across six math benchmarks and cross-domain tasks show that DIVER outperforms strong RLVR baselines and generalizes across models, with notable improvements in Pass@k metrics, illustrating enhanced reasoning scope and generalization. The work suggests that optimizing global diversity can significantly advance versatile reasoning in LLMs and highlights avenues for future multi-turn RLVR and richer diversity measures.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a crucial paradigm for incentivizing reasoning capabilities in Large Language Models (LLMs). Due to vast state-action spaces and reward sparsity in reasoning tasks, existing methods often struggle with deficient exploration and poor sample efficiency. In the paper, we propose \textbf{DIVER} (\textbf{D}iversity-\textbf{I}ncentivized Exploration for \textbf{V}ersatil\textbf{E} \textbf{R}easoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning. We first conduct a primary empirical study to reveal a strong positive correlation between global diversity and reasoning capacity. Building on this insight, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. Incorporating the intrinsic reward, we develop a potential-based reward shaping mechanism to preserve optimal policy invariance and design simple heuristics to mitigate possible reward hacking. Experimental results show that DIVER outperforms competitive RLVR baselines with various exploration strategies on both in-domain and out-of-domain tasks, excelling in both Pass@1 and Pass@k evaluations. Our code is available at https://github.com/NJU-RL/DIVER.

Diversity-Incentivized Exploration for Versatile Reasoning

TL;DR

The paper introduces DIVER, a Diversity-Incentivized Exploration framework for Verifiable RL in LLM reasoning, focusing on global sequence-level diversity to drive deep exploration. It defines Textual Diversity and Equational Diversity as semantically structured metrics, and uses a potential-based intrinsic reward R_int to shape learning while preserving optimal policy invariance. Conditional shaping and clipping mitigate reward hacking, enabling a balance between correctness and diverse reasoning paths. Empirical results across six math benchmarks and cross-domain tasks show that DIVER outperforms strong RLVR baselines and generalizes across models, with notable improvements in Pass@k metrics, illustrating enhanced reasoning scope and generalization. The work suggests that optimizing global diversity can significantly advance versatile reasoning in LLMs and highlights avenues for future multi-turn RLVR and richer diversity measures.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a crucial paradigm for incentivizing reasoning capabilities in Large Language Models (LLMs). Due to vast state-action spaces and reward sparsity in reasoning tasks, existing methods often struggle with deficient exploration and poor sample efficiency. In the paper, we propose \textbf{DIVER} (\textbf{D}iversity-\textbf{I}ncentivized Exploration for \textbf{V}ersatil\textbf{E} \textbf{R}easoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning. We first conduct a primary empirical study to reveal a strong positive correlation between global diversity and reasoning capacity. Building on this insight, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. Incorporating the intrinsic reward, we develop a potential-based reward shaping mechanism to preserve optimal policy invariance and design simple heuristics to mitigate possible reward hacking. Experimental results show that DIVER outperforms competitive RLVR baselines with various exploration strategies on both in-domain and out-of-domain tasks, excelling in both Pass@1 and Pass@k evaluations. Our code is available at https://github.com/NJU-RL/DIVER.

Paper Structure

This paper contains 41 sections, 2 theorems, 60 equations, 10 figures, 4 tables.

Key Result

Theorem 1

Let $M\!=\!(S,A,T,R,\gamma)$ denote the MDP for the LLM reasoning task. $d(\cdot)\!:S\mapsto \mathbb{R}$ is a real-valued function that computes the sequence-level diversity $d(s)$ of the state $s$ within a group of rollouts. We formulate $R_{\text{int}}(\cdot)\!:S\!\times\! A\!\times\! S\mapsto \ma

Figures (10)

  • Figure 1: Local token-level vs. Global sequence-level exploration. We incentivize deep exploration to broaden diverse pathways for versatile reasoning.
  • Figure 2: Overview of DIVER where we formulate the global sequence-level diversity of response $o_i$ within a group of $G$ rollouts as an intrinsic reward $r_i^{\text{int}}$ to incentivize deep exploration. Diversity incentives are applied to correct solutions only to align shaping rewards with the true objective.
  • Figure 3: Performance comparison between high-diversity (red) and low-diversity (blue) training. solve all: Number of samples with all rollouts correctly solved. solve none: Samples with no correct rollouts. in-domain: Average test scores across training steps for in-domain benchmarks. out-of-domain: Final performance for out-of-domain benchmarks.
  • Figure 4: Training dynamics comparison with other exploration method across different metrics. $\uparrow$ indicates metrics where higher values are more diverse for ED and TD.
  • Figure 5: Average scores across in-domain and out-of-domain tasks for DIVER with different models. Complete results in Table \ref{['tab:main_other_models']}.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Theorem 1: Optimal Policy Invariance
  • Theorem 1: Optimal Policy Invariance
  • proof