Table of Contents
Fetching ...

Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

Feng Ju, Zeyu Qin, Rui Min, Zhitao He, Lingpeng Kong, Yi R. Fung

TL;DR

This work tackles the diversity bottleneck in large language model reasoning by proposing a one problem, multiple solutions (1PNS) training paradigm. It introduces Reasoning Path Divergence (RPD), a step-level metric that captures semantic differences between long chain-of-thought solutions by summarizing steps and performing asymmetric matching. Using RPD, the authors curate a diverse training set from OpenThought3 and fine-tune Qwen3-4B-Base, achieving consistent improvements in pass@k across math benchmarks, notably +2.80% on average for pass@16 and +4.99% on AIME24. The results demonstrate that diversity-driven data curation complements Test-Time Scaling and can significantly boost reasoning performance, with broader implications for designing more interpretable and versatile LLM reasoning strategies.

Abstract

While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common "one problem, one solution" (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a "one problem, multiple solutions" (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .

Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

TL;DR

This work tackles the diversity bottleneck in large language model reasoning by proposing a one problem, multiple solutions (1PNS) training paradigm. It introduces Reasoning Path Divergence (RPD), a step-level metric that captures semantic differences between long chain-of-thought solutions by summarizing steps and performing asymmetric matching. Using RPD, the authors curate a diverse training set from OpenThought3 and fine-tune Qwen3-4B-Base, achieving consistent improvements in pass@k across math benchmarks, notably +2.80% on average for pass@16 and +4.99% on AIME24. The results demonstrate that diversity-driven data curation complements Test-Time Scaling and can significantly boost reasoning performance, with broader implications for designing more interpretable and versatile LLM reasoning strategies.

Abstract

While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common "one problem, one solution" (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a "one problem, multiple solutions" (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .

Paper Structure

This paper contains 64 sections, 5 equations, 4 figures, 16 tables, 3 algorithms.

Figures (4)

  • Figure 1: The workflow of our Reasoning Path Divergence (RPD) metric. Given two solutions (A and B), an LLM first decomposes them into step-level summaries. An asymmetric matching is then performed: each step in the shorter summary (A) is matched to its semantically closest counterpart in the longer summary (B) based on embedding cosine distance. The final RPD score is the average of these minimum distances. Detailed examples with analysis is provided in Appendix \ref{['app:rpd_case_studies']}.
  • Figure 2: Performance comparison of our 1P3S approach against the 1P1S baseline across three mathematical reasoning benchmarks. Each subplot corresponds to a different benchmark, showing the pass@k accuracy for k=1, 2, 4, 8, 16.
  • Figure 3: Distribution of pairwise diversity scores on 100 problems for the baseline (left) and our RPD metric (right). RPD provides a significantly better-separated distribution.
  • Figure 4: PCA visualization of raw solution and step summary embeddings. The step embeddings for the two solutions occupy distinct regions of the space, reflecting a strategic diversity that our RPD metric correctly identifies. In contrast, the raw solution embeddings are nearly collinear, causing the baseline method to fail to distinguish them.