Table of Contents
Fetching ...

Advancing Process Verification for Large Language Models via Tree-Based Preference Learning

Mingqian He, Yongliang Shen, Wenqi Zhang, Zeqi Tan, Weiming Lu

TL;DR

This work addresses the limitations of binary verifiers in evaluating stepwise reasoning from large language models. It introduces Tree-PLV, a tree-based verifier trained with step-level preferences in a best-first reasoning-tree framework, using look-ahead completions to compute step rewards and a pairwise ranking loss to train with high-resolution feedback. Empirical results across GSM8K, MATH500, CSQA, and StrategyQA show substantial gains over strong baselines and demonstrate good generalization across different generators, with data-efficient training. The approach offers finer-grained, more robust verification of reasoning paths and highlights the value of step-level guidance for improving the reliability of LLM reasoning in complex tasks.

Abstract

Large Language Models (LLMs) have demonstrated remarkable potential in handling complex reasoning tasks by generating step-by-step rationales.Some methods have proven effective in boosting accuracy by introducing extra verifiers to assess these paths. However, existing verifiers, typically trained on binary-labeled reasoning paths, fail to fully utilize the relative merits of intermediate steps, thereby limiting the effectiveness of the feedback provided. To overcome this limitation, we propose Tree-based Preference Learning Verifier (Tree-PLV), a novel approach that constructs reasoning trees via a best-first search algorithm and collects step-level paired data for preference training. Compared to traditional binary classification, step-level preferences more finely capture the nuances between reasoning steps, allowing for a more precise evaluation of the complete reasoning path. We empirically evaluate Tree-PLV across a range of arithmetic and commonsense reasoning tasks, where it significantly outperforms existing benchmarks. For instance, Tree-PLV achieved substantial performance gains over the Mistral-7B self-consistency baseline on GSM8K (67.55% to 82.79%), MATH (17.00% to 26.80%), CSQA (68.14% to 72.97%), and StrategyQA (82.86% to 83.25%).Additionally, our study explores the appropriate granularity for applying preference learning, revealing that step-level guidance provides feedback that better aligns with the evaluation of the reasoning process.

Advancing Process Verification for Large Language Models via Tree-Based Preference Learning

TL;DR

This work addresses the limitations of binary verifiers in evaluating stepwise reasoning from large language models. It introduces Tree-PLV, a tree-based verifier trained with step-level preferences in a best-first reasoning-tree framework, using look-ahead completions to compute step rewards and a pairwise ranking loss to train with high-resolution feedback. Empirical results across GSM8K, MATH500, CSQA, and StrategyQA show substantial gains over strong baselines and demonstrate good generalization across different generators, with data-efficient training. The approach offers finer-grained, more robust verification of reasoning paths and highlights the value of step-level guidance for improving the reliability of LLM reasoning in complex tasks.

Abstract

Large Language Models (LLMs) have demonstrated remarkable potential in handling complex reasoning tasks by generating step-by-step rationales.Some methods have proven effective in boosting accuracy by introducing extra verifiers to assess these paths. However, existing verifiers, typically trained on binary-labeled reasoning paths, fail to fully utilize the relative merits of intermediate steps, thereby limiting the effectiveness of the feedback provided. To overcome this limitation, we propose Tree-based Preference Learning Verifier (Tree-PLV), a novel approach that constructs reasoning trees via a best-first search algorithm and collects step-level paired data for preference training. Compared to traditional binary classification, step-level preferences more finely capture the nuances between reasoning steps, allowing for a more precise evaluation of the complete reasoning path. We empirically evaluate Tree-PLV across a range of arithmetic and commonsense reasoning tasks, where it significantly outperforms existing benchmarks. For instance, Tree-PLV achieved substantial performance gains over the Mistral-7B self-consistency baseline on GSM8K (67.55% to 82.79%), MATH (17.00% to 26.80%), CSQA (68.14% to 72.97%), and StrategyQA (82.86% to 83.25%).Additionally, our study explores the appropriate granularity for applying preference learning, revealing that step-level guidance provides feedback that better aligns with the evaluation of the reasoning process.
Paper Structure (36 sections, 2 equations, 10 figures, 3 tables)

This paper contains 36 sections, 2 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: A comparison of different methods: Traditional verifiers rely on binary labels for outcome and process supervision, whereas Tree-PLV employs preferences instead of scalar values.
  • Figure 2: The construction process of the reasoning tree. Best-first search consistently selects the child node with highest reward for further expansion. To evaluate the quality of the $i$-th step, we sample $N$ completions from it, denoted as $\mathcal{P}_i$. The reward is then calculated based on the proportion of these $N$ paths that yield the correct answer.
  • Figure 3: Performance of different verifiers across varying numbers of solution (N) generated by Mistral-7B.
  • Figure 4: A performance comparison of verifiers trained with different levels of feedback granularity.
  • Figure 5: Performance comparison of MCTS and Tree-PLV across different generators on GSM8K.
  • ...and 5 more figures