Table of Contents
Fetching ...

Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

Yunqiao Yang, Houxing Ren, Zimu Lu, Ke Wang, Weikang Shi, Aojun Zhou, Junting Pan, Mingjie Zhan, Hongsheng Li

TL;DR

Probability-Consistent Preference Optimization (PCPO) strengthens preference data for LLMs by adding token-level probability coherence to the traditional outcome-driven criterion. The method generates multiple responses per prompt, filters candidate pairs using Levenshtein distance, and computes a token-probability consistency score to derive a pair-weighted score that selects high-quality preference pairs. The PCPO loss then jointly optimizes a weighted DPO term and a weighted NLL term, prioritizing pairs with strong internal coherence, and yields improved results on GSM8K, MATH-500, Olympiadbench, and AMC23 across multiple seed models, while remaining adaptable to other DPO variants. This work demonstrates that modeling the internal logical structure of responses, not just the final answer, enhances reasoning capabilities and alignment; however, it relies on gold-standard answers and incurs additional computation for candidate-pair generation. Overall, PCPO offers a principled and effective pathway to richer preference data and more robust mathematical reasoning in LLMs.

Abstract

Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.

Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

TL;DR

Probability-Consistent Preference Optimization (PCPO) strengthens preference data for LLMs by adding token-level probability coherence to the traditional outcome-driven criterion. The method generates multiple responses per prompt, filters candidate pairs using Levenshtein distance, and computes a token-probability consistency score to derive a pair-weighted score that selects high-quality preference pairs. The PCPO loss then jointly optimizes a weighted DPO term and a weighted NLL term, prioritizing pairs with strong internal coherence, and yields improved results on GSM8K, MATH-500, Olympiadbench, and AMC23 across multiple seed models, while remaining adaptable to other DPO variants. This work demonstrates that modeling the internal logical structure of responses, not just the final answer, enhances reasoning capabilities and alignment; however, it relies on gold-standard answers and incurs additional computation for candidate-pair generation. Overall, PCPO offers a principled and effective pathway to richer preference data and more robust mathematical reasoning in LLMs.

Abstract

Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.

Paper Structure

This paper contains 33 sections, 6 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the PCPO method. The pipeline mainly consists of three steps. (1) Given a prompt set, utilize $M_t$ ($M_0$ as the seed model) to generate responses $y_i^n$ with reasoning $c_i^n$ and answer $a_i^n$, and construct candidate pairs on correctness §\ref{['subsec: candidate']}. (2) utilize $M_t$ to calculate weighted scores $s_{w}$ for each pair based on the token probability consistency, and select preference pairs based on it §\ref{['subsec: pair']}. (3) train the next iteration model $M_{t+1}$ with the selected preference pairs and PCPO Loss §\ref{['subsec: loss']}.
  • Figure 2: Rewards of PCPO and DPO. The chosen and rejected responses reward comparison of PCPO and DPO training on the same preference pairs.
  • Figure 3: A few right and wrong responses from the same prompt. The four responses can be divided into two groups, where each has a similar response pattern.
  • Figure 4: Frequency Distribution and Cumulative Percentage Pareto Chart of Edit Distance Rankings.
  • Figure 5: The Match Function pipeline. For a given pair of chosen and rejected responses, we first utilize the current iteration model $M_t$ to tokenize them and then use the algorithm \ref{['alg: Match Function']} to get the longest common token subsequences, as highlighted in different colors.