Table of Contents
Fetching ...

DiConStruct: Causal Concept-based Explanations through Black-Box Distillation

Ricardo Moreira, Jacopo Bono, Mário Cardoso, Pedro Saleiro, Mário A. T. Figueiredo, Pedro Bizarro

TL;DR

DiConStruct presents a surrogate explainer that distills a black-box predictor into a structural causal model over human-interpretable concepts, enabling causal graphs and concept attributions without sacrificing main task performance. The framework decouples exogenous factors through an independence objective and learns a concept distillation SCM guided by a predefined causal DAG $oldsymbol{ ext{G}}_{C,Y_B}$, leveraging global and local edge mechanisms and counterfactual reasoning. Training optimizes a trio of losses, including an independence loss $\\\\mathcal{L}_E$, to achieve high fidelity to the black-box while preserving concept predictiveness ($\\L_C$) and distillation accuracy ($\\L_D$). Empirical results on image and tabular data show that DiConStruct attains high fidelity and competitive concept accuracy, with local configurations delivering notable improvements in explanation quality and enabling interpretable counterfactual analyses of concept interactions and their effects on predictions.

Abstract

Model interpretability plays a central role in human-AI decision-making systems. Ideally, explanations should be expressed using human-interpretable semantic concepts. Moreover, the causal relations between these concepts should be captured by the explainer to allow for reasoning about the explanations. Lastly, explanation methods should be efficient and not compromise the performance of the predictive task. Despite the rapid advances in AI explainability in recent years, as far as we know to date, no method fulfills these three properties. Indeed, mainstream methods for local concept explainability do not produce causal explanations and incur a trade-off between explainability and prediction performance. We present DiConStruct, an explanation method that is both concept-based and causal, with the goal of creating more interpretable local explanations in the form of structural causal models and concept attributions. Our explainer works as a distillation model to any black-box machine learning model by approximating its predictions while producing the respective explanations. Because of this, DiConStruct generates explanations efficiently while not impacting the black-box prediction task. We validate our method on an image dataset and a tabular dataset, showing that DiConStruct approximates the black-box models with higher fidelity than other concept explainability baselines, while providing explanations that include the causal relations between the concepts.

DiConStruct: Causal Concept-based Explanations through Black-Box Distillation

TL;DR

DiConStruct presents a surrogate explainer that distills a black-box predictor into a structural causal model over human-interpretable concepts, enabling causal graphs and concept attributions without sacrificing main task performance. The framework decouples exogenous factors through an independence objective and learns a concept distillation SCM guided by a predefined causal DAG , leveraging global and local edge mechanisms and counterfactual reasoning. Training optimizes a trio of losses, including an independence loss , to achieve high fidelity to the black-box while preserving concept predictiveness () and distillation accuracy (). Empirical results on image and tabular data show that DiConStruct attains high fidelity and competitive concept accuracy, with local configurations delivering notable improvements in explanation quality and enabling interpretable counterfactual analyses of concept interactions and their effects on predictions.

Abstract

Model interpretability plays a central role in human-AI decision-making systems. Ideally, explanations should be expressed using human-interpretable semantic concepts. Moreover, the causal relations between these concepts should be captured by the explainer to allow for reasoning about the explanations. Lastly, explanation methods should be efficient and not compromise the performance of the predictive task. Despite the rapid advances in AI explainability in recent years, as far as we know to date, no method fulfills these three properties. Indeed, mainstream methods for local concept explainability do not produce causal explanations and incur a trade-off between explainability and prediction performance. We present DiConStruct, an explanation method that is both concept-based and causal, with the goal of creating more interpretable local explanations in the form of structural causal models and concept attributions. Our explainer works as a distillation model to any black-box machine learning model by approximating its predictions while producing the respective explanations. Because of this, DiConStruct generates explanations efficiently while not impacting the black-box prediction task. We validate our method on an image dataset and a tabular dataset, showing that DiConStruct approximates the black-box models with higher fidelity than other concept explainability baselines, while providing explanations that include the causal relations between the concepts.
Paper Structure (35 sections, 9 equations, 6 figures, 7 tables)

This paper contains 35 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: DiConStruct is a surrogate explainer for obtaining explanations in the form of (1) causal graphs relating concepts to black-box scores and (2) concept attributions. (a) DiConStruct overview. (b) Toy example of a learned causal graph explanation for predicting cardiac arrest. For example, a low likelihood of smoking (white node) has negative impact on cholesterol (red edge) but positive impact on weight (blue edge).
  • Figure 2: DiConStruct architecture. On the left side are the inputs to our explainer: (1) an instance to explain ${\bm{x}}$; (2) a DAG relating the concepts (Section \ref{['sec:obtaining-a-causal-graph']} discusses approaches to obtain this DAG); (3) a black-box classifier.
  • Figure 3: Distribution of fidelity and concept accuracy for the CUB-200-2011 (left) and Merchant Fraud (right) black-boxes. Each plot shows the result of 16 experiments of 50 trials, and each marker is a trained DiConStruct explainer. Models on the Pareto efficiency frontier are highlighted with bigger markers.
  • Figure 4: (a) Learned SCM for an instance of the Merchant Fraud dataset. Blue edge color denotes positive interactions, red denotes negative interactions, and the intensity represents the interaction strength. A positive/negative interaction increases/decreases the value of the destination node, respectively. Concept likelihood (nodes) is encoded from white (low) to black (high). (b) Concept attribution plot for the same instance.
  • Figure 5: Concept attribution densities for two DiConStruct explainers trained with the Trivial DAG and evaluated on the Merchant Fraud test set.
  • ...and 1 more figures