Diversity-Incentivized Exploration for Versatile Reasoning

Zican Hu; Shilin Zhang; Yafu Li; Jianhao Yan; Xuyang Hu; Leyang Cui; Xiaoye Qu; Chunlin Chen; Yu Cheng; Zhi Wang

Diversity-Incentivized Exploration for Versatile Reasoning

Zican Hu, Shilin Zhang, Yafu Li, Jianhao Yan, Xuyang Hu, Leyang Cui, Xiaoye Qu, Chunlin Chen, Yu Cheng, Zhi Wang

TL;DR

The paper introduces DIVER, a Diversity-Incentivized Exploration framework for Verifiable RL in LLM reasoning, focusing on global sequence-level diversity to drive deep exploration. It defines Textual Diversity and Equational Diversity as semantically structured metrics, and uses a potential-based intrinsic reward R_int to shape learning while preserving optimal policy invariance. Conditional shaping and clipping mitigate reward hacking, enabling a balance between correctness and diverse reasoning paths. Empirical results across six math benchmarks and cross-domain tasks show that DIVER outperforms strong RLVR baselines and generalizes across models, with notable improvements in Pass@k metrics, illustrating enhanced reasoning scope and generalization. The work suggests that optimizing global diversity can significantly advance versatile reasoning in LLMs and highlights avenues for future multi-turn RLVR and richer diversity measures.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a crucial paradigm for incentivizing reasoning capabilities in Large Language Models (LLMs). Due to vast state-action spaces and reward sparsity in reasoning tasks, existing methods often struggle with deficient exploration and poor sample efficiency. In the paper, we propose \textbf{DIVER} (\textbf{D}iversity-\textbf{I}ncentivized Exploration for \textbf{V}ersatil\textbf{E} \textbf{R}easoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning. We first conduct a primary empirical study to reveal a strong positive correlation between global diversity and reasoning capacity. Building on this insight, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. Incorporating the intrinsic reward, we develop a potential-based reward shaping mechanism to preserve optimal policy invariance and design simple heuristics to mitigate possible reward hacking. Experimental results show that DIVER outperforms competitive RLVR baselines with various exploration strategies on both in-domain and out-of-domain tasks, excelling in both Pass@1 and Pass@k evaluations. Our code is available at https://github.com/NJU-RL/DIVER.

Diversity-Incentivized Exploration for Versatile Reasoning

TL;DR

Abstract

Diversity-Incentivized Exploration for Versatile Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (3)