Neural Networks Decoded: Targeted and Robust Analysis of Neural Network Decisions via Causal Explanations and Reasoning

Alec F. Diallo; Vaishak Belle; Paul Patras

Neural Networks Decoded: Targeted and Robust Analysis of Neural Network Decisions via Causal Explanations and Reasoning

Alec F. Diallo, Vaishak Belle, Paul Patras

TL;DR

Tracer addresses the opacity of deep neural networks by presenting a causal-explanation framework that operates without retraining or architectural changes. It formalizes the problem with Structural Causal Models and Pearl's Causal Hierarchy, and combines input interventions, CKA-based causal-node discovery, ACE estimation, and GAN-based counterfactual generation to reveal the internal decision dynamics. The Average Causal Effect (ACE) is defined as $ACE_i = \mathbb{E}_{P(X)}[|\Delta_x^i| \, \cdot \, \mathrm{KL}\left(P(g_i'(x) | do(X=x')) \| P(g_i'(x) | do(X=x))\right)]$, providing a quantitative measure of each layer-group's causal impact. Experiments on MNIST, ImageNet, and CIC-IDS show Tracer yields coherent explanations, supports global explainability, and enables compression with minimal loss in accuracy, highlighting its practical relevance for trustworthy and efficient DNN deployment.

Abstract

Despite their success and widespread adoption, the opaque nature of deep neural networks (DNNs) continues to hinder trust, especially in critical applications. Current interpretability solutions often yield inconsistent or oversimplified explanations, or require model changes that compromise performance. In this work, we introduce TRACER, a novel method grounded in causal inference theory designed to estimate the causal dynamics underpinning DNN decisions without altering their architecture or compromising their performance. Our approach systematically intervenes on input features to observe how specific changes propagate through the network, affecting internal activations and final outputs. Based on this analysis, we determine the importance of individual features, and construct a high-level causal map by grouping functionally similar layers into cohesive causal nodes, providing a structured and interpretable view of how different parts of the network influence the decisions. TRACER further enhances explainability by generating counterfactuals that reveal possible model biases and offer contrastive explanations for misclassifications. Through comprehensive evaluations across diverse datasets, we demonstrate TRACER's effectiveness over existing methods and show its potential for creating highly compressed yet accurate models, illustrating its dual versatility in both understanding and optimizing DNNs.

Neural Networks Decoded: Targeted and Robust Analysis of Neural Network Decisions via Causal Explanations and Reasoning

TL;DR

, providing a quantitative measure of each layer-group's causal impact. Experiments on MNIST, ImageNet, and CIC-IDS show Tracer yields coherent explanations, supports global explainability, and enables compression with minimal loss in accuracy, highlighting its practical relevance for trustworthy and efficient DNN deployment.

Abstract

Paper Structure (26 sections, 3 theorems, 8 equations, 11 figures, 1 table)

This paper contains 26 sections, 3 theorems, 8 equations, 11 figures, 1 table.

Introduction
Related Work
Theoretical Foundations and Methodology
Causal Theory
Causal Discovery
Interventions
Causal Abstraction
Estimation of Causal Effects
Counterfactual Generation
Experiments
Causal Discovery and Feature Attributions
Counterfactual Analysis
Generalization and Scalability
Beyond Local Explainability
Discussions and Limitations
...and 11 more sections

Key Result

Proposition 1

Let $F: \mathcal{X} \rightarrow \mathcal{Y}$ denote the mapping function of a DNN. For any $x \in \mathcal{X}$, $I \subseteq \{1, \ldots, d\}$, and $b \in \mathbb{R}$, the intervened sample $x'$ isolates the causal effect of the features in $I$ on $F$ by setting the values of $x_i, \forall i \in I$

Figures (11)

Figure 1: Overview of Tracer. Interventions and counterfactuals are used to determine the effects of individual features on the models' intermediate and final outputs, leading to the discovery of the mechanisms underpinning the decision-making process.
Figure 2: Counterfactual GAN architecture.
Figure 3: Tracer's causal analysis results for an MNIST sample classified by AlexNet. The causal structure is inferred using CKA similarities between activation outputs from various layers. Nodes in the resulting causal graph symbolize layer groups, while the connections between them capture their causal relationships.
Figure 4: Reliability scores of different explainability methods on the MNIST dataset.
Figure 5: Tracer vs existing XAI methods using an ImageNet sample classified by ResNet-50. The second row shows feature contributions from different causal nodes, while the bottom row compares the explanations provided by different methods. The sparse explanations given by SHAP and LRP may require high-resolution screens for adequate visualization.
...and 6 more figures

Theorems & Definitions (13)

Definition 1: Structural Causal Model
Definition 2: Explanation
Proposition 1: Causal Isolation of Intervened Samples
Definition 3: Layer Groups
Theorem 1: Layer Grouping
Theorem 2: Necessary and Sufficient Conditions for Causal Nodes
Definition 4: Causal Links between Layer Groups
Definition 5: Average Causal Effect
Remark
Remark
...and 3 more

Neural Networks Decoded: Targeted and Robust Analysis of Neural Network Decisions via Causal Explanations and Reasoning

TL;DR

Abstract

Neural Networks Decoded: Targeted and Robust Analysis of Neural Network Decisions via Causal Explanations and Reasoning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (13)