Table of Contents
Fetching ...

Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification

Zhenwen Liang, Ye Liu, Tong Niu, Xiangliang Zhang, Yingbo Zhou, Semih Yavuz

TL;DR

This work tackles the unreliability of LLMs in multi-step reasoning by scaling inference-time computation through sampling multiple reasoning paths and applying verifiers. It introduces Math-Rev and Code-Rev verifiers trained with a reference-free SimPO framework on a large, diverse dataset of correct and incorrect solutions for math and code tasks, enabling robust scoring and ranking of outputs. A novel CoTnPoT approach combines readable chain-of-thought with executable program-of-thought to improve verification signals, achieving state-of-the-art results on GSM8k and MATH and outperforming GPT-4o with Qwen-72B-Instruct as reasoner. The method demonstrates strong cross-model generalization and offers a practical pathway to more reliable and scalable reasoning in LLMs.

Abstract

Despite significant advancements in the general capability of large language models (LLMs), they continue to struggle with consistent and accurate reasoning, especially in complex tasks such as mathematical and code reasoning. One key limitation is that LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors, which hampers their ability to reliably verify and rank outputs. To address this, we scale up the inference-time computation by generating multiple reasoning paths and employing verifiers to assess and rank the generated outputs by correctness. To facilitate this, we introduce a comprehensive dataset consisting of correct and incorrect solutions for math and code tasks, generated by multiple LLMs. This diverse set of solutions enables verifiers to more effectively distinguish and rank correct answers from erroneous outputs. The training methods for building verifiers were selected based on an extensive comparison of existing approaches. Moreover, to leverage the unique strengths of different reasoning strategies, we propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification. CoT provides a clear, step-by-step reasoning process that enhances interpretability, while PoT, being executable, offers a precise and error-sensitive validation mechanism. By taking both of their strengths, our approach significantly improves the accuracy and reliability of reasoning verification. Our verifiers, Math-Rev and Code-Rev, demonstrate substantial performance gains to existing LLMs, achieving state-of-the-art results on benchmarks such as GSM8k and MATH and even outperforming GPT-4o with Qwen-72B-Instruct as the reasoner.

Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification

TL;DR

This work tackles the unreliability of LLMs in multi-step reasoning by scaling inference-time computation through sampling multiple reasoning paths and applying verifiers. It introduces Math-Rev and Code-Rev verifiers trained with a reference-free SimPO framework on a large, diverse dataset of correct and incorrect solutions for math and code tasks, enabling robust scoring and ranking of outputs. A novel CoTnPoT approach combines readable chain-of-thought with executable program-of-thought to improve verification signals, achieving state-of-the-art results on GSM8k and MATH and outperforming GPT-4o with Qwen-72B-Instruct as reasoner. The method demonstrates strong cross-model generalization and offers a practical pathway to more reliable and scalable reasoning in LLMs.

Abstract

Despite significant advancements in the general capability of large language models (LLMs), they continue to struggle with consistent and accurate reasoning, especially in complex tasks such as mathematical and code reasoning. One key limitation is that LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors, which hampers their ability to reliably verify and rank outputs. To address this, we scale up the inference-time computation by generating multiple reasoning paths and employing verifiers to assess and rank the generated outputs by correctness. To facilitate this, we introduce a comprehensive dataset consisting of correct and incorrect solutions for math and code tasks, generated by multiple LLMs. This diverse set of solutions enables verifiers to more effectively distinguish and rank correct answers from erroneous outputs. The training methods for building verifiers were selected based on an extensive comparison of existing approaches. Moreover, to leverage the unique strengths of different reasoning strategies, we propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification. CoT provides a clear, step-by-step reasoning process that enhances interpretability, while PoT, being executable, offers a precise and error-sensitive validation mechanism. By taking both of their strengths, our approach significantly improves the accuracy and reliability of reasoning verification. Our verifiers, Math-Rev and Code-Rev, demonstrate substantial performance gains to existing LLMs, achieving state-of-the-art results on benchmarks such as GSM8k and MATH and even outperforming GPT-4o with Qwen-72B-Instruct as the reasoner.
Paper Structure (28 sections, 3 equations, 6 figures, 4 tables)

This paper contains 28 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison of greedy decoding accuracy and recall out of 64 sampled solutions on GSM8k dataset with various LLMs.
  • Figure 2: Accuracy on the MATH test set across models.
  • Figure 3: The workflow of our method. We first sample solutions from multiple LLM reasoners and then train verifiers using preference loss (Step 1). During inference (Step 2), we sample multiple CoT solutions per question and use a coder LLM to transform them into a PoT format. Then we filter out any CoT answers that do not match with their corresponding PoT results and feed the remaining CoT solutions to the verifier. The solution with the highest score is selected as the final answer. An example of CoT and PoT solutions is attached.
  • Figure 4: Performance of different verifiers (all better than greedy decoding)
  • Figure 5: Ablation study on CoTnPoT.
  • ...and 1 more figures