Probability-Consistent Preference Optimization for Enhanced LLM Reasoning
Yunqiao Yang, Houxing Ren, Zimu Lu, Ke Wang, Weikang Shi, Aojun Zhou, Junting Pan, Mingjie Zhan, Hongsheng Li
TL;DR
Probability-Consistent Preference Optimization (PCPO) strengthens preference data for LLMs by adding token-level probability coherence to the traditional outcome-driven criterion. The method generates multiple responses per prompt, filters candidate pairs using Levenshtein distance, and computes a token-probability consistency score to derive a pair-weighted score that selects high-quality preference pairs. The PCPO loss then jointly optimizes a weighted DPO term and a weighted NLL term, prioritizing pairs with strong internal coherence, and yields improved results on GSM8K, MATH-500, Olympiadbench, and AMC23 across multiple seed models, while remaining adaptable to other DPO variants. This work demonstrates that modeling the internal logical structure of responses, not just the final answer, enhances reasoning capabilities and alignment; however, it relies on gold-standard answers and incurs additional computation for candidate-pair generation. Overall, PCPO offers a principled and effective pathway to richer preference data and more robust mathematical reasoning in LLMs.
Abstract
Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.
