Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

Yunqiao Yang; Houxing Ren; Zimu Lu; Ke Wang; Weikang Shi; Aojun Zhou; Junting Pan; Mingjie Zhan; Hongsheng Li

Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

Yunqiao Yang, Houxing Ren, Zimu Lu, Ke Wang, Weikang Shi, Aojun Zhou, Junting Pan, Mingjie Zhan, Hongsheng Li

TL;DR

Probability-Consistent Preference Optimization (PCPO) strengthens preference data for LLMs by adding token-level probability coherence to the traditional outcome-driven criterion. The method generates multiple responses per prompt, filters candidate pairs using Levenshtein distance, and computes a token-probability consistency score to derive a pair-weighted score that selects high-quality preference pairs. The PCPO loss then jointly optimizes a weighted DPO term and a weighted NLL term, prioritizing pairs with strong internal coherence, and yields improved results on GSM8K, MATH-500, Olympiadbench, and AMC23 across multiple seed models, while remaining adaptable to other DPO variants. This work demonstrates that modeling the internal logical structure of responses, not just the final answer, enhances reasoning capabilities and alignment; however, it relies on gold-standard answers and incurs additional computation for candidate-pair generation. Overall, PCPO offers a principled and effective pathway to richer preference data and more robust mathematical reasoning in LLMs.

Abstract

Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.

Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

TL;DR

Abstract

Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)