Table of Contents
Fetching ...

B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical Data

Magdalena Proszewska, Tomasz Danel, Dawid Rymarczyk

TL;DR

B-XAIC introduces a large real-world benchmark for explainable AI in chemistry, pairing 50K molecular graphs with ground-truth atom- and bond-level rationales across seven tasks. It enables direct evaluation of both post-hoc and intrinsically interpretable GNNs by separating null and subgraph explanations and evaluating node- and edge-level fidelity. Across experiments, high predictive accuracy (e.g., for GIN) coexists with inconsistent and sometimes misleading explanations, underscoring limitations of current XAI techniques for molecular graphs. The benchmark provides a rigorous, shareable standard to drive development of faithful, robust explainability methods for graph-based drug discovery and material design.

Abstract

Understanding the reasoning behind deep learning model predictions is crucial in cheminformatics and drug discovery, where molecular design determines their properties. However, current evaluation frameworks for Explainable AI (XAI) in this domain often rely on artificial datasets or simplified tasks, employing data-derived metrics that fail to capture the complexity of real-world scenarios and lack a direct link to explanation faithfulness. To address this, we introduce B-XAIC, a novel benchmark constructed from real-world molecular data and diverse tasks with known ground-truth rationales for assigned labels. Through a comprehensive evaluation using B-XAIC, we reveal limitations of existing XAI methods for Graph Neural Networks (GNNs) in the molecular domain. This benchmark provides a valuable resource for gaining deeper insights into the faithfulness of XAI, facilitating the development of more reliable and interpretable models.

B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical Data

TL;DR

B-XAIC introduces a large real-world benchmark for explainable AI in chemistry, pairing 50K molecular graphs with ground-truth atom- and bond-level rationales across seven tasks. It enables direct evaluation of both post-hoc and intrinsically interpretable GNNs by separating null and subgraph explanations and evaluating node- and edge-level fidelity. Across experiments, high predictive accuracy (e.g., for GIN) coexists with inconsistent and sometimes misleading explanations, underscoring limitations of current XAI techniques for molecular graphs. The benchmark provides a rigorous, shareable standard to drive development of faithful, robust explainability methods for graph-based drug discovery and material design.

Abstract

Understanding the reasoning behind deep learning model predictions is crucial in cheminformatics and drug discovery, where molecular design determines their properties. However, current evaluation frameworks for Explainable AI (XAI) in this domain often rely on artificial datasets or simplified tasks, employing data-derived metrics that fail to capture the complexity of real-world scenarios and lack a direct link to explanation faithfulness. To address this, we introduce B-XAIC, a novel benchmark constructed from real-world molecular data and diverse tasks with known ground-truth rationales for assigned labels. Through a comprehensive evaluation using B-XAIC, we reveal limitations of existing XAI methods for Graph Neural Networks (GNNs) in the molecular domain. This benchmark provides a valuable resource for gaining deeper insights into the faithfulness of XAI, facilitating the development of more reliable and interpretable models.

Paper Structure

This paper contains 26 sections, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Schematic of our B-XAIC dataset and benchmark; (a) the dataset preparation steps include compound labeling, filtering, and sampling to the training, validation, and testing subsets; (b) for each positive example, atom and bond labels are provided to assess model explanations; (c) the patterns for the halogen and indole tasks are presented, as well as four example PAINS patterns
  • Figure 2: Histogram of Tanimoto similarities illustrating the diversity of the dataset.
  • Figure 3: Evaluation of node-level explanations for GIN. Null explanation results are shown in green, and subgraph explanation results in orange. Overall average scores for each method are displayed in the center.
  • Figure 4: Evaluation of edge-level explanations for GIN. Null explanation results are shown in green, and subgraph explanation results in orange. Overall average scores for each method are displayed in the center.
  • Figure 5: Boxplots showing the distribution of explanation quality across different explainers for each model. Results are aggregated per model, highlighting that some models are inherently more difficult to explain than others.
  • ...and 11 more figures