Aligning with Logic: Measuring, Evaluating and Improving Logical Preference Consistency in Large Language Models

Yinhong Liu; Zhijiang Guo; Tianya Liang; Ehsan Shareghi; Ivan Vulić; Nigel Collier

Aligning with Logic: Measuring, Evaluating and Improving Logical Preference Consistency in Large Language Models

Yinhong Liu, Zhijiang Guo, Tianya Liang, Ehsan Shareghi, Ivan Vulić, Nigel Collier

TL;DR

Problem: LLMs exhibit inconsistencies in reasoning and decision-making. Approach: formalize logical preference consistency via transitivity, commutativity, and negation invariance and introduce REPAIR for data refinement and augmentation. Contributions: a universal measurement framework, extensive cross-model evaluation, and evidence that improving consistency enhances logic-based downstream performance, while preserving human alignment. Impact: supports deploying more reliable, logically coherent AI systems in high-stakes contexts.

Abstract

Large Language Models (LLMs) are expected to be predictable and trustworthy to support reliable decision-making systems. Yet current LLMs often show inconsistencies in their judgments. In this work, we examine logical preference consistency as a foundational requirement for building more dependable LLM systems, ensuring stable and coherent decision-making while minimizing erratic or contradictory outputs. To quantify the logical preference consistency, we propose a universal evaluation framework based on three fundamental properties: transitivity, commutativity and negation invariance. Through extensive experimentation across diverse LLMs, we demonstrate that these properties serve as strong indicators of judgment robustness. Furthermore, we introduce a data refinement and augmentation technique, REPAIR, that enhances logical consistency while maintaining alignment with human preferences. Finally, we show that improving consistency leads to better performance in LLM-driven logic-based algorithms, reinforcing stability and coherence in decision-making systems.

Aligning with Logic: Measuring, Evaluating and Improving Logical Preference Consistency in Large Language Models

TL;DR

Abstract

Paper Structure (29 sections, 4 equations, 9 figures, 7 tables, 2 algorithms)

This paper contains 29 sections, 4 equations, 9 figures, 7 tables, 2 algorithms.

Introduction
Measuring Logical Consistency
Measuring Transitivity
Measuring Commutativity
Measuring Negation Invariance
Evaluating Logical Consistency of LLMs
Evaluation Setup
Results and Analysis
Consistency and Reliability
Improve Logical Preference Consistency in LLMs via REPAIR
Estimating Rankings from Noisy Pairwise Data
Experiments
Impact of Logical Preference Consistency on Downstream Applications
Related Work
Conclusion
...and 14 more sections

Figures (9)

Figure 1: Three types of logical inconsistencies are observed in real-world pairwise annotations (top row): Transitivity, Commutativity, and Negation Invariance. By refining the data for self-consistency using rank estimation, we can train LLMs with enhanced logical consistency, improving their performance in logic-dependent algorithms (bottom row).
Figure 2: Example of relation graphs illustrating transitivity, where items are represented as nodes, and directed edges indicate pairwise preferential relations. Red dashed cycles in the graph highlight violations of transitivity. The cycle in (d), spanning 4 items, cannot be captured by $s_{tran}(3)$. The $s_{tran}$ metric can be applied to partial relation graphs, as shown in (c) and (d).
Figure 3: Examples illustrating violations of commutativity and negation invariance. Each entry of the two preference matrices represents predicted judgments of $x_i\succ x_j$ and $x_i \prec x_j$, labelled with A and B respectively. The top-left matrix is based on the original relation, while the bottom-right matrix reflects the negated relation. Linked red cycles highlight non-commutative pairs, and linked dashed purple cycles indicate negation inconsistencies.
Figure 4: Transitivity shows strong correlations with self-agreement across all three datasets. Self-agreement is measured as the percentage of majority choices from 10 CoT inferences, generated with a temperature of 0.7.
Figure 5: Commutativity shows a generally strong correlation with human preference across various LLMs and datasets.
...and 4 more figures

Aligning with Logic: Measuring, Evaluating and Improving Logical Preference Consistency in Large Language Models

TL;DR

Abstract

Aligning with Logic: Measuring, Evaluating and Improving Logical Preference Consistency in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)