Table of Contents
Fetching ...

PSPO*: An Effective Process-supervised Policy Optimization for Reasoning Alignment

Jiawei Li, Xinyue Liang, Junlong Zhang, Yizhe Yang, Chong Feng, Yang Gao

TL;DR

The PSPO-WRS is developed, which considers the number of reasoning steps in determining reward scores and utilizes an adjusted Weibull distribution for nonlinear reward shaping and consistently outperforms current mainstream models.

Abstract

Process supervision enhances the performance of large language models in reasoning tasks by providing feedback at each step of chain-of-thought reasoning. However, due to the lack of effective process supervision methods, even advanced large language models are prone to logical errors and redundant reasoning. We claim that the effectiveness of process supervision significantly depends on both the accuracy and the length of reasoning chains. Moreover, we identify that these factors exhibit a nonlinear relationship with the overall reward score of the reasoning process. Inspired by these insights, we propose a novel process supervision paradigm, PSPO*, which systematically outlines the workflow from reward model training to policy optimization, and highlights the importance of nonlinear rewards in process supervision. Based on PSPO*, we develop the PSPO-WRS, which considers the number of reasoning steps in determining reward scores and utilizes an adjusted Weibull distribution for nonlinear reward shaping. Experimental results on six mathematical reasoning datasets demonstrate that PSPO-WRS consistently outperforms current mainstream models.

PSPO*: An Effective Process-supervised Policy Optimization for Reasoning Alignment

TL;DR

The PSPO-WRS is developed, which considers the number of reasoning steps in determining reward scores and utilizes an adjusted Weibull distribution for nonlinear reward shaping and consistently outperforms current mainstream models.

Abstract

Process supervision enhances the performance of large language models in reasoning tasks by providing feedback at each step of chain-of-thought reasoning. However, due to the lack of effective process supervision methods, even advanced large language models are prone to logical errors and redundant reasoning. We claim that the effectiveness of process supervision significantly depends on both the accuracy and the length of reasoning chains. Moreover, we identify that these factors exhibit a nonlinear relationship with the overall reward score of the reasoning process. Inspired by these insights, we propose a novel process supervision paradigm, PSPO*, which systematically outlines the workflow from reward model training to policy optimization, and highlights the importance of nonlinear rewards in process supervision. Based on PSPO*, we develop the PSPO-WRS, which considers the number of reasoning steps in determining reward scores and utilizes an adjusted Weibull distribution for nonlinear reward shaping. Experimental results on six mathematical reasoning datasets demonstrate that PSPO-WRS consistently outperforms current mainstream models.

Paper Structure

This paper contains 27 sections, 14 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: An example from QQA dataset. The reasoning error solution has an error in step[2] where the model confuses the concept of time period and time point, resulting in a wrong answer. The imcomplete reasoning solution simply jump to the final answer after summarizing the problem, which is incomplete and unreasonable. And the redundant steps generate too much noise.
  • Figure 2: The data annotation approach for PRM. Unlike ORM, the annotation approach of PRM cannot generate pairwise preference data, thus precluding the use of the Bradley-Terry method for training the reward model.
  • Figure 3: The overall method of PSPO*. In the workflow of process supervision, we encompass a nonlinear accumulation function that is correlated with the accuracy of the reasoning chains and nonlinear reward shaping that shapes the rewards for the length of the reasoning chains.
  • Figure 4: The adjusted weibull distribution. Prameter settings are: $C=10.735$, $k=1.5$, and $\lambda=8.0$.
  • Figure 5: The results compared with ultra LLMs. It is noteworthy that our model outperforms ultra LLMs in most scenarios with only 7B parameters.
  • ...and 2 more figures