Table of Contents
Fetching ...

Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths

Yew Ken Chia, Guizhen Chen, Weiwen Xu, Luu Anh Tuan, Soujanya Poria, Lidong Bing

TL;DR

This work introduces a specialized training framework called Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths in language model reasoning, and significantly enhances the reasoning performance of large language models.

Abstract

Advanced models such as OpenAI o1 exhibit impressive problem-solving capabilities through step-by-step reasoning. However, they may still falter on more complex problems, making errors that disrupt their reasoning paths. We attribute this to the expansive solution space, where each step has the risk of diverging into mistakes. To enhance language model reasoning, we introduce a specialized training framework called Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem-solving performance. Reasoning Paths Optimization does not rely on large-scale human-annotated rationales or outputs from closed-source models, making it scalable and data-efficient. We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions. The experiments demonstrate that our framework significantly enhances the reasoning performance of large language models, with up to 3.1% and 4.3% improvement on GSM8K and MMLU (STEM) respectively. Our data and code can be found at https://reasoning-paths.github.io.

Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths

TL;DR

This work introduces a specialized training framework called Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths in language model reasoning, and significantly enhances the reasoning performance of large language models.

Abstract

Advanced models such as OpenAI o1 exhibit impressive problem-solving capabilities through step-by-step reasoning. However, they may still falter on more complex problems, making errors that disrupt their reasoning paths. We attribute this to the expansive solution space, where each step has the risk of diverging into mistakes. To enhance language model reasoning, we introduce a specialized training framework called Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem-solving performance. Reasoning Paths Optimization does not rely on large-scale human-annotated rationales or outputs from closed-source models, making it scalable and data-efficient. We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions. The experiments demonstrate that our framework significantly enhances the reasoning performance of large language models, with up to 3.1% and 4.3% improvement on GSM8K and MMLU (STEM) respectively. Our data and code can be found at https://reasoning-paths.github.io.

Paper Structure

This paper contains 33 sections, 10 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: An example of how the reasoning path of the model can easily diverge to unfavorable branches that fail to reach the correct solution. While we show a simplified example here, the challenge is amplified for more complex questions that require longer reasoning paths.
  • Figure 2: An overview of our Reasoning Paths Optimization framework for exploring and learning over diverse reasoning paths.
  • Figure 3: Main results showing the evaluation accuracy (%) of different training methods on math reasoning questions in GSM8K and MATH, and science-based exam questions in MMLU-STEM. We also indicate the improvement of our method compared to the highest-performing baseline.
  • Figure 4: The effect of exploration loss weight on the MATH dataset performance for LLaMA-3-8B.
  • Figure 5: Performance with respect to reasoning path length on the MATH dataset for LLaMA-3-8B.
  • ...and 2 more figures