Table of Contents
Fetching ...

TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning

Yujiao Yang

TL;DR

TRUE introduces a trustworthy unified explanation framework for LLM reasoning that combines executable explanations, local feasible-region modeling via structure-preserving perturbations, and cluster-level failure-mode analysis. It defines executable explanations $E=(e_1,\dots,e_T)$ and a blind executor $V$ to verify reasoning; it builds a local DAG $G=(S,E)$ to capture feasible reasoning trajectories, and uses Shapley values $\phi_i$ to attribute failure modes at the class level. The approach is evaluated on GSM8K, MATH, MMLU, and BBH, showing that executable explanations can recover correct answers, local feasible regions provide structural coverage, and failure modes identify systematic weaknesses with quantified influence. The results establish a principled, executable, and interpretable paradigm for diagnosing and improving the reliability of LLM reasoning.

Abstract

Large language models (LLMs) have demonstrated strong capabilities in complex reasoning tasks, yet their decision-making processes remain difficult to interpret. Existing explanation methods often lack trustworthy structural insight and are limited to single-instance analysis, failing to reveal reasoning stability and systematic failure mechanisms. To address these limitations, we propose the Trustworthy Unified Explanation Framework (TRUE), which integrates executable reasoning verification, feasible-region directed acyclic graph (DAG) modeling, and causal failure mode analysis. At the instance level, we redefine reasoning traces as executable process specifications and introduce blind execution verification to assess operational validity. At the local structural level, we construct feasible-region DAGs via structure-consistent perturbations, enabling explicit characterization of reasoning stability and the executable region in the local input space. At the class level, we introduce a causal failure mode analysis method that identifies recurring structural failure patterns and quantifies their causal influence using Shapley values. Extensive experiments across multiple reasoning benchmarks demonstrate that the proposed framework provides multi-level, verifiable explanations, including executable reasoning structures for individual instances, feasible-region representations for neighboring inputs, and interpretable failure modes with quantified importance at the class level. These results establish a unified and principled paradigm for improving the interpretability and reliability of LLM reasoning systems.

TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning

TL;DR

TRUE introduces a trustworthy unified explanation framework for LLM reasoning that combines executable explanations, local feasible-region modeling via structure-preserving perturbations, and cluster-level failure-mode analysis. It defines executable explanations and a blind executor to verify reasoning; it builds a local DAG to capture feasible reasoning trajectories, and uses Shapley values to attribute failure modes at the class level. The approach is evaluated on GSM8K, MATH, MMLU, and BBH, showing that executable explanations can recover correct answers, local feasible regions provide structural coverage, and failure modes identify systematic weaknesses with quantified influence. The results establish a principled, executable, and interpretable paradigm for diagnosing and improving the reliability of LLM reasoning.

Abstract

Large language models (LLMs) have demonstrated strong capabilities in complex reasoning tasks, yet their decision-making processes remain difficult to interpret. Existing explanation methods often lack trustworthy structural insight and are limited to single-instance analysis, failing to reveal reasoning stability and systematic failure mechanisms. To address these limitations, we propose the Trustworthy Unified Explanation Framework (TRUE), which integrates executable reasoning verification, feasible-region directed acyclic graph (DAG) modeling, and causal failure mode analysis. At the instance level, we redefine reasoning traces as executable process specifications and introduce blind execution verification to assess operational validity. At the local structural level, we construct feasible-region DAGs via structure-consistent perturbations, enabling explicit characterization of reasoning stability and the executable region in the local input space. At the class level, we introduce a causal failure mode analysis method that identifies recurring structural failure patterns and quantifies their causal influence using Shapley values. Extensive experiments across multiple reasoning benchmarks demonstrate that the proposed framework provides multi-level, verifiable explanations, including executable reasoning structures for individual instances, feasible-region representations for neighboring inputs, and interpretable failure modes with quantified importance at the class level. These results establish a unified and principled paradigm for improving the interpretability and reliability of LLM reasoning systems.
Paper Structure (30 sections, 17 equations, 4 figures, 5 tables)

This paper contains 30 sections, 17 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Generation and verification of executable explanations. Given an input instance, the LLM interpreter produces a structured executable explanation consisting of explicit reasoning steps. The explanation is executed by a blind execution verifier to obtain a predicted answer, while white-box tools independently verify the correctness of each reasoning step.
  • Figure 2: Construction of the local feasible-region graph via structure-preserving perturbations. Verified perturbed samples are aggregated into a directed acyclic graph, where nodes represent reasoning steps and edges encode their dependencies.
  • Figure 3: Overview of the proposed failure mode discovery and attribution framework. Given a sample problem class, semantically similar instances are grouped via LLM-assisted clustering, followed by systematic failure mode discovery. Controlled perturbations and mixed-mode testing are then performed to evaluate reasoning behavior under failure combinations. Finally, Shapley value attribution quantifies the causal impact of each failure mode on prediction outcomes.
  • Figure 4: Cluster size scaling of failure mode stability. Mode overlap (Jaccard) and ranking stability (Kendall’s $\tau$) increase and saturate with cluster size, indicating robust failure mode discovery.