Table of Contents
Fetching ...

Chasing Consistency: Quantifying and Optimizing Human-Model Alignment in Chain-of-Thought Reasoning

Boxuan Wang, Zhuoyun Li, Xinmiao Huang, Xiaowei Huang, Yi Dong

TL;DR

Chasing Consistency introduces Alignment Score, a semantic-entropy-based metric that quantifies how closely model Chain-of-Thought reasoning aligns with human reference chains. It formalizes this alignment via semantic entropy matrices and Semantic Structural Divergence (SSD), then decomposes misalignment into four actionable error types: Logical Disconnection, Thematic Shift, Redundant Reasoning, and Causal Reversal. Empirically, 2-hop reasoning yields the highest Alignment Score, with Redundant Reasoning identified as the strongest predictor of degraded alignment. The authors propose Semantic Consistency Optimization Sampling (SCOS), which selects semantically consistent chains by minimizing TS or RR, achieving an average $29.84\%$ improvement in $S$—particularly benefiting smaller models—and demonstrating a practical path to more trustworthy, coherent multi-step reasoning in LLMs.

Abstract

This paper presents a framework for evaluating and optimizing reasoning consistency in Large Language Models (LLMs) via a new metric, the Alignment Score, which quantifies the semantic alignment between model-generated reasoning chains and human-written reference chains in Chain-of-Thought (CoT) reasoning. Empirically, we find that 2-hop reasoning chains achieve the highest Alignment Score. To explain this phenomenon, we define four key error types: logical disconnection, thematic shift, redundant reasoning, and causal reversal, and show how each contributes to the degradation of the Alignment Score. Building on this analysis, we further propose Semantic Consistency Optimization Sampling (SCOS), a method that samples and favors chains with minimal alignment errors, significantly improving Alignment Scores by an average of 29.84% with longer reasoning chains, such as in 3-hop tasks.

Chasing Consistency: Quantifying and Optimizing Human-Model Alignment in Chain-of-Thought Reasoning

TL;DR

Chasing Consistency introduces Alignment Score, a semantic-entropy-based metric that quantifies how closely model Chain-of-Thought reasoning aligns with human reference chains. It formalizes this alignment via semantic entropy matrices and Semantic Structural Divergence (SSD), then decomposes misalignment into four actionable error types: Logical Disconnection, Thematic Shift, Redundant Reasoning, and Causal Reversal. Empirically, 2-hop reasoning yields the highest Alignment Score, with Redundant Reasoning identified as the strongest predictor of degraded alignment. The authors propose Semantic Consistency Optimization Sampling (SCOS), which selects semantically consistent chains by minimizing TS or RR, achieving an average improvement in —particularly benefiting smaller models—and demonstrating a practical path to more trustworthy, coherent multi-step reasoning in LLMs.

Abstract

This paper presents a framework for evaluating and optimizing reasoning consistency in Large Language Models (LLMs) via a new metric, the Alignment Score, which quantifies the semantic alignment between model-generated reasoning chains and human-written reference chains in Chain-of-Thought (CoT) reasoning. Empirically, we find that 2-hop reasoning chains achieve the highest Alignment Score. To explain this phenomenon, we define four key error types: logical disconnection, thematic shift, redundant reasoning, and causal reversal, and show how each contributes to the degradation of the Alignment Score. Building on this analysis, we further propose Semantic Consistency Optimization Sampling (SCOS), a method that samples and favors chains with minimal alignment errors, significantly improving Alignment Scores by an average of 29.84% with longer reasoning chains, such as in 3-hop tasks.

Paper Structure

This paper contains 18 sections, 10 equations, 3 figures, 5 tables, 2 algorithms.

Figures (3)

  • Figure 1: Comparison of Reasoning Imitation between a Strong and a Weak Model under CoT Prompting: The Alignment Score captures the discrepancy (the better model yields a higher Alignment Score).
  • Figure 2: Illustration of the Alignment Score Computation: (1) Construct the prompt with a reference reasoning chain and generate model reasoning chains (red box; see Appendix \ref{['sec:appendix-A']} for prompt templates); (2) Compute pairwise semantic entropies using an NLI model (blue box); (3) Compare the resulting semantic entropies and derive the final alignment score (green box).
  • Figure 3: Examples of Four Types of Alignment Errors