Table of Contents
Fetching ...

When the Chain Breaks: Interactive Diagnosis of LLM Chain-of-Thought Reasoning Errors

Shiwei Chen, Niruthikka Sritharan, Xiaolin Wen, Chenxi Zhang, Xingbo Wang, Yong Wang

Abstract

Current Large Language Models (LLMs), especially Large Reasoning Models, can generate Chain-of-Thought (CoT) reasoning traces to illustrate how they produce final outputs, thereby facilitating trust calibration for users. However, these CoT reasoning traces are usually lengthy and tedious, and can contain various issues, such as logical and factual errors, which make it difficult for users to interpret the reasoning traces efficiently and accurately. To address these challenges, we develop an error detection pipeline that combines external fact-checking with symbolic formal logical validation to identify errors at the step level. Building on this pipeline, we propose ReasonDiag, an interactive visualization system for diagnosing CoT reasoning traces. ReasonDiag provides 1) an integrated arc diagram to show reasoning-step distributions and error-propagation patterns, and 2) a hierarchical node-link diagram to visualize high-level reasoning flows and premise dependencies. We evaluate ReasonDiag through a technical evaluation for the error detection pipeline, two case studies, and user interviews with 16 participants. The results indicate that ReasonDiag helps users effectively understand CoT reasoning traces, identify erroneous steps, and determine their root causes.

When the Chain Breaks: Interactive Diagnosis of LLM Chain-of-Thought Reasoning Errors

Abstract

Current Large Language Models (LLMs), especially Large Reasoning Models, can generate Chain-of-Thought (CoT) reasoning traces to illustrate how they produce final outputs, thereby facilitating trust calibration for users. However, these CoT reasoning traces are usually lengthy and tedious, and can contain various issues, such as logical and factual errors, which make it difficult for users to interpret the reasoning traces efficiently and accurately. To address these challenges, we develop an error detection pipeline that combines external fact-checking with symbolic formal logical validation to identify errors at the step level. Building on this pipeline, we propose ReasonDiag, an interactive visualization system for diagnosing CoT reasoning traces. ReasonDiag provides 1) an integrated arc diagram to show reasoning-step distributions and error-propagation patterns, and 2) a hierarchical node-link diagram to visualize high-level reasoning flows and premise dependencies. We evaluate ReasonDiag through a technical evaluation for the error detection pipeline, two case studies, and user interviews with 16 participants. The results indicate that ReasonDiag helps users effectively understand CoT reasoning traces, identify erroneous steps, and determine their root causes.
Paper Structure (32 sections, 7 figures, 4 tables)

This paper contains 32 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The Error Detection Pipeline comprises three stages: (1) Premise Tree Generation structures the raw CoT by classifying step roles and mapping dependencies; (2) Factual Error Detection verifies checkable claims using retrieval-augmented external evidence; and (3) Logical Error Detection translates natural language steps into symbolic constraints for formal consistency checks via the Z3 solver. The CoT reasoning example above is prompted by the question: "How many years have passed between the launch of the Hubble Space Telescope and the year 2025?"
  • Figure 1: Example of Claude presenting a summarized reasoning trace with multiple steps.
  • Figure 2: ReasonDiag interface: (A) The Overview displays ordinal reasoning steps (A1) along a horizontal axis (A4) and highlights uncertain regions (A7) and error propagation (A6). Users can adjust the shown steps (A5) using two filters (A2, A3). (B) The Section View presents a hierarchical summary through textual section labels (B1) and colored step markers (B2), allowing users to click on erroneous steps to reveal their premise–conclusion relationships (B3) and the associated diagnostic evidence (B4, B5). (C) The Original CoT provides the full textual CoT, organized either by individual reasoning steps or by sections for contextual inspection (C1, C2).
  • Figure 2: Example of ChatGPT presenting a summarized reasoning trace with brief explanations.
  • Figure 3: With ReasonDiag, a user diagnoses errors and reasoning patterns in a mathematical CoT. (A) Problem statement. (B) Overview of step types and error propagation, with (B1) highlighting the "polluted" final answer and (B2) a retroactive reasoning pattern. (C) Structured reasoning trace, where (C1) shows the premise–conclusion chain to the answer and the red error path, and (C2) describes the self-correction phase. (D) Original CoT text, with (D1–D4) illustrating retroactive reasoning. (E) Erroneous step and its premises, with (E1) revealing a mistake where the thousands digit is not decremented.
  • ...and 2 more figures