Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

Feng Ju; Zeyu Qin; Rui Min; Zhitao He; Lingpeng Kong; Yi R. Fung

Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

Feng Ju, Zeyu Qin, Rui Min, Zhitao He, Lingpeng Kong, Yi R. Fung

TL;DR

This work tackles the diversity bottleneck in large language model reasoning by proposing a one problem, multiple solutions (1PNS) training paradigm. It introduces Reasoning Path Divergence (RPD), a step-level metric that captures semantic differences between long chain-of-thought solutions by summarizing steps and performing asymmetric matching. Using RPD, the authors curate a diverse training set from OpenThought3 and fine-tune Qwen3-4B-Base, achieving consistent improvements in pass@k across math benchmarks, notably +2.80% on average for pass@16 and +4.99% on AIME24. The results demonstrate that diversity-driven data curation complements Test-Time Scaling and can significantly boost reasoning performance, with broader implications for designing more interpretable and versatile LLM reasoning strategies.

Abstract

While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common "one problem, one solution" (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a "one problem, multiple solutions" (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .

Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

TL;DR

Abstract

Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)