Chasing Consistency: Quantifying and Optimizing Human-Model Alignment in Chain-of-Thought Reasoning

Boxuan Wang; Zhuoyun Li; Xinmiao Huang; Xiaowei Huang; Yi Dong

Chasing Consistency: Quantifying and Optimizing Human-Model Alignment in Chain-of-Thought Reasoning

Boxuan Wang, Zhuoyun Li, Xinmiao Huang, Xiaowei Huang, Yi Dong

TL;DR

Chasing Consistency introduces Alignment Score, a semantic-entropy-based metric that quantifies how closely model Chain-of-Thought reasoning aligns with human reference chains. It formalizes this alignment via semantic entropy matrices and Semantic Structural Divergence (SSD), then decomposes misalignment into four actionable error types: Logical Disconnection, Thematic Shift, Redundant Reasoning, and Causal Reversal. Empirically, 2-hop reasoning yields the highest Alignment Score, with Redundant Reasoning identified as the strongest predictor of degraded alignment. The authors propose Semantic Consistency Optimization Sampling (SCOS), which selects semantically consistent chains by minimizing TS or RR, achieving an average $29.84\%$ improvement in $S$—particularly benefiting smaller models—and demonstrating a practical path to more trustworthy, coherent multi-step reasoning in LLMs.

Abstract

This paper presents a framework for evaluating and optimizing reasoning consistency in Large Language Models (LLMs) via a new metric, the Alignment Score, which quantifies the semantic alignment between model-generated reasoning chains and human-written reference chains in Chain-of-Thought (CoT) reasoning. Empirically, we find that 2-hop reasoning chains achieve the highest Alignment Score. To explain this phenomenon, we define four key error types: logical disconnection, thematic shift, redundant reasoning, and causal reversal, and show how each contributes to the degradation of the Alignment Score. Building on this analysis, we further propose Semantic Consistency Optimization Sampling (SCOS), a method that samples and favors chains with minimal alignment errors, significantly improving Alignment Scores by an average of 29.84% with longer reasoning chains, such as in 3-hop tasks.

Chasing Consistency: Quantifying and Optimizing Human-Model Alignment in Chain-of-Thought Reasoning

TL;DR

Abstract

Chasing Consistency: Quantifying and Optimizing Human-Model Alignment in Chain-of-Thought Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)