Table of Contents
Fetching ...

Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

Changyuan Tian, Zhicong Lu, Shuang Qian, Nayu Liu, Peiguang Li, Li Jin, Leiyi Hu, Zhizhao Zeng, Sirui Wang, Ke Zeng, Zhi Guo

TL;DR

This work investigates why LLMs critiquing MsMR are biased toward low-perplexity solutions by building an OPS benchmark that contrasts self- and cross-solution evaluation. It reveals a robust, perplexity-linked imbalanced evaluation preference and proposes perplexity-aware Group Relative Policy Optimization ($GRPO$) with perplexity-modulated advantages and class-level loss aggregation to balance exploration and learning. Empirical results on OPS and ProcessBench across multiple base models demonstrate improved critiquing accuracy, reduced evaluation bias, and better cross-dataset generalization, establishing state-of-the-art performance on critic benchmarks. The approach enables scalable, reliable supervision for enhancing multi-step mathematical reasoning and critique localization in LLMs, with broad practical implications for AI-assisted problem solving.

Abstract

To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason -- imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs' critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon -- ``LLMs incline to judge solutions with lower perplexity as correct'', which is dubbed as \textit{imbalanced evaluation preference}. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.

Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

TL;DR

This work investigates why LLMs critiquing MsMR are biased toward low-perplexity solutions by building an OPS benchmark that contrasts self- and cross-solution evaluation. It reveals a robust, perplexity-linked imbalanced evaluation preference and proposes perplexity-aware Group Relative Policy Optimization () with perplexity-modulated advantages and class-level loss aggregation to balance exploration and learning. Empirical results on OPS and ProcessBench across multiple base models demonstrate improved critiquing accuracy, reduced evaluation bias, and better cross-dataset generalization, establishing state-of-the-art performance on critic benchmarks. The approach enables scalable, reliable supervision for enhancing multi-step mathematical reasoning and critique localization in LLMs, with broad practical implications for AI-assisted problem solving.

Abstract

To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason -- imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs' critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon -- ``LLMs incline to judge solutions with lower perplexity as correct'', which is dubbed as \textit{imbalanced evaluation preference}. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.

Paper Structure

This paper contains 21 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Visualization of the negative correlation between perplexity and BI for each critic model. Each point represents a decile bin, with the red regression line and gray-shaded area denoting the fitted linear trend and the $95\%$ confidence interval, respectively. Higher perplexity values are associated with lower BI.
  • Figure 2: Illustration of our proposed perplexity-aware GRPO. Here, we take the correct-group within a batch (i.e., the sample set where the ground truth is correct) as an example; the wrong group follows the same process.
  • Figure 3: Exploration comparison between vanilla and perplexity-aware GRPO. Closer curves across perplexity bins imply more balanced exploration.