How Likely Do LLMs with CoT Mimic Human Reasoning?

Guangsheng Bao; Hongbo Zhang; Cunxiang Wang; Linyi Yang; Yue Zhang

How Likely Do LLMs with CoT Mimic Human Reasoning?

Guangsheng Bao, Hongbo Zhang, Cunxiang Wang, Linyi Yang, Yue Zhang

TL;DR

The paper investigates whether LLMs with Chain-of-Thought (CoT) mimic human reasoning by modeling the task as a causal system among $Z$ (instruction), $X$ (CoT), and $Y$ (answer). Using interventions to identify structural causal models (SCMs) and testing across six reasoning tasks and multiple models, it finds that many task–model pairs exhibit common-cause or full-connection structures, which can cause consistency and faithfulness issues. The study reveals that in-context learning strengthens the causal chain while supervised fine-tuning and RLHF weaken it, and that simply increasing model size does not guarantee improved causal structure or human-like reasoning. These findings emphasize the need for new techniques beyond scaling to align LLM reasoning with human cognition and enhance CoT fidelity and reliability.

Abstract

Chain-of-thought emerges as a promising technique for eliciting reasoning capabilities from Large Language Models (LLMs). However, it does not always improve task performance or accurately represent reasoning processes, leaving unresolved questions about its usage. In this paper, we diagnose the underlying mechanism by comparing the reasoning process of LLMs with humans, using causal analysis to understand the relationships between the problem instruction, reasoning, and the answer in LLMs. Our empirical study reveals that LLMs often deviate from the ideal causal chain, resulting in spurious correlations and potential consistency errors (inconsistent reasoning and answers). We also examine various factors influencing the causal structure, finding that in-context learning with examples strengthens it, while post-training techniques like supervised fine-tuning and reinforcement learning on human feedback weaken it. To our surprise, the causal structure cannot be strengthened by enlarging the model size only, urging research on new techniques. We hope that this preliminary study will shed light on understanding and improving the reasoning process in LLM.

How Likely Do LLMs with CoT Mimic Human Reasoning?

TL;DR

The paper investigates whether LLMs with Chain-of-Thought (CoT) mimic human reasoning by modeling the task as a causal system among

(instruction),

(CoT), and

(answer). Using interventions to identify structural causal models (SCMs) and testing across six reasoning tasks and multiple models, it finds that many task–model pairs exhibit common-cause or full-connection structures, which can cause consistency and faithfulness issues. The study reveals that in-context learning strengthens the causal chain while supervised fine-tuning and RLHF weaken it, and that simply increasing model size does not guarantee improved causal structure or human-like reasoning. These findings emphasize the need for new techniques beyond scaling to align LLM reasoning with human cognition and enhance CoT fidelity and reliability.

Abstract

Paper Structure (44 sections, 6 equations, 3 figures, 18 tables)

This paper contains 44 sections, 6 equations, 3 figures, 18 tables.

Introduction
Related Work
LLM Reasoning.
Chain-of-Thought Faithfulness.
Causal Reasoning in LLMs.
Causal Analysis
Random Variables
Identification of SCM
SCM Types
Experiments
Experimental Settings
Models.
Datasets.
Evaluation of Consistency Error.
Causal Structures in LLM Tasks
...and 29 more sections

Figures (3)

Figure 1: Causal analysis, where we identify an SCM from an LLM-task pair using treatment experiments. For each pair of variables with possible causal relation, we conduct an experiment by injecting an intervention into the treated variable and observe its effect.
Figure 2: Four types of SCM, where the structure of an SCM reveals its latent behavior, providing explanations on when and why problems may occur during the reasoning process.
Figure 3: Three examples of CoT mistakes, where either the CoT is incorrect but the answer is correct, or the other way around. The red highlights the incorrect steps, with explanations provided at the end of each.

Theorems & Definitions (4)

Definition 3.1: Cause-Effect Interventions
Definition 3.2: Average Treatment Effect
Definition B.1: Structural Causal Model
Definition B.2: Confounder of Variables

How Likely Do LLMs with CoT Mimic Human Reasoning?

TL;DR

Abstract

How Likely Do LLMs with CoT Mimic Human Reasoning?

Authors

TL;DR

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (4)