Table of Contents
Fetching ...

Towards Stepwise Domain Knowledge-Driven Reasoning Optimization and Reflection Improvement

Chengyuan Liu, Shihang Wang, Lizhi Qing, Kaisong Song, Junjie Cao, Jun Lin, Ji Zhang, Ang Li, Kun Kuang, Fei Wu

TL;DR

This paper tackles knowledge-intensive reasoning in the legal domain by introducing SKROP, a stepwise domain knowledge-driven reasoning framework that uses MCTS with XML-tagged thoughts to generate step-level supervision. It further introduces PORP to optimize the quality of reflections, guiding self-reflection when missteps occur. Through extensive experiments on legal datasets, SKROP and PORP achieve improved accuracy over baselines, with stronger gains on more capable base models, illustrating the value of automatic, domain-specific supervision. The approach advances domain-specific LLM reasoning by combining structured stepwise supervision, diverse exploration, and reflection-guided learning, offering a cost-effective path to knowledge-grounded reasoning in specialized fields.

Abstract

Recently, stepwise supervision on Chain of Thoughts (CoTs) presents an enhancement on the logical reasoning tasks such as coding and math, with the help of Monte Carlo Tree Search (MCTS). However, its contribution to tasks requiring domain-specific expertise and knowledge remains unexplored. Motivated by the interest, we identify several potential challenges of vanilla MCTS within this context, and propose the framework of Stepwise Domain Knowledge-Driven Reasoning Optimization, employing the MCTS algorithm to develop step-level supervision for problems that require essential comprehension, reasoning, and specialized knowledge. Additionally, we also introduce the Preference Optimization towards Reflection Paths, which iteratively learns self-reflection on the reasoning thoughts from better perspectives. We have conducted extensive experiments to evaluate the advantage of the methodologies. Empirical results demonstrate the effectiveness on various legal-domain problems. We also report a diverse set of valuable findings, hoping to encourage the enthusiasm to the research of domain-specific LLMs and MCTS.

Towards Stepwise Domain Knowledge-Driven Reasoning Optimization and Reflection Improvement

TL;DR

This paper tackles knowledge-intensive reasoning in the legal domain by introducing SKROP, a stepwise domain knowledge-driven reasoning framework that uses MCTS with XML-tagged thoughts to generate step-level supervision. It further introduces PORP to optimize the quality of reflections, guiding self-reflection when missteps occur. Through extensive experiments on legal datasets, SKROP and PORP achieve improved accuracy over baselines, with stronger gains on more capable base models, illustrating the value of automatic, domain-specific supervision. The approach advances domain-specific LLM reasoning by combining structured stepwise supervision, diverse exploration, and reflection-guided learning, offering a cost-effective path to knowledge-grounded reasoning in specialized fields.

Abstract

Recently, stepwise supervision on Chain of Thoughts (CoTs) presents an enhancement on the logical reasoning tasks such as coding and math, with the help of Monte Carlo Tree Search (MCTS). However, its contribution to tasks requiring domain-specific expertise and knowledge remains unexplored. Motivated by the interest, we identify several potential challenges of vanilla MCTS within this context, and propose the framework of Stepwise Domain Knowledge-Driven Reasoning Optimization, employing the MCTS algorithm to develop step-level supervision for problems that require essential comprehension, reasoning, and specialized knowledge. Additionally, we also introduce the Preference Optimization towards Reflection Paths, which iteratively learns self-reflection on the reasoning thoughts from better perspectives. We have conducted extensive experiments to evaluate the advantage of the methodologies. Empirical results demonstrate the effectiveness on various legal-domain problems. We also report a diverse set of valuable findings, hoping to encourage the enthusiasm to the research of domain-specific LLMs and MCTS.

Paper Structure

This paper contains 38 sections, 14 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of a question in legal examination.
  • Figure 2: Framework of SKROP. SKROP builds the tree, which starts with a root node, consisting of a question and the corresponding options. The chosen (green smiling face) and rejected (red crying face) trajectories are sampled with their precedent steps, to train the policy model and the value head.
  • Figure 3: PORP optimizes the preference of reflection paths. The dotted lines are built for reflection training. We highlight the reflection paths in red.
  • Figure 4: Performance of step-level and solution-level supervision.
  • Figure 5: Accuracy scores at different rounds.
  • ...and 4 more figures