Table of Contents
Fetching ...

Gnothi Seauton: Empowering Faithful Self-Interpretability in Black-Box Transformers

Shaobo Wang, Hongxuan Tang, Mingyang Wang, Hongrui Zhang, Xuyang Liu, Weiya Li, Xuming Hu, Linfeng Zhang

TL;DR

The paper addresses the XAI gap between self-interpretable models and post-hoc explanations by proposing AutoGnothi, a parameter-efficient pipeline that side-tunes lightweight surrogates and explainers to enable faithful Shapley-value explanations without altering the original backbone. It provides theoretical guarantees for surrogate and explainer training and demonstrates substantial reductions in training memory, parameter counts, and inference costs while maintaining or improving explanation quality on vision (ViT) and language (BERT) tasks. AutoGnothi enables single-pass inference for predictions and explanations, offering practical benefits for high-stakes decisions. Overall, the work advances practical self-interpretability for black-box transformers through Ladder Side-Tuning and SURROGATE+EXPLAINER joint optimization that preserves accuracy and fidelity of explanations across domains.

Abstract

The debate between self-interpretable models and post-hoc explanations for black-box models is central to Explainable AI (XAI). Self-interpretable models, such as concept-based networks, offer insights by connecting decisions to human-understandable concepts but often struggle with performance and scalability. Conversely, post-hoc methods like Shapley values, while theoretically robust, are computationally expensive and resource-intensive. To bridge the gap between these two lines of research, we propose a novel method that combines their strengths, providing theoretically guaranteed self-interpretability for black-box models without compromising prediction accuracy. Specifically, we introduce a parameter-efficient pipeline, AutoGnothi, which integrates a small side network into the black-box model, allowing it to generate Shapley value explanations without changing the original network parameters. This side-tuning approach significantly reduces memory, training, and inference costs, outperforming traditional parameter-efficient methods, where full fine-tuning serves as the optimal baseline. AutoGnothi enables the black-box model to predict and explain its predictions with minimal overhead. Extensive experiments show that AutoGnothi offers accurate explanations for both vision and language tasks, delivering superior computational efficiency with comparable interpretability.

Gnothi Seauton: Empowering Faithful Self-Interpretability in Black-Box Transformers

TL;DR

The paper addresses the XAI gap between self-interpretable models and post-hoc explanations by proposing AutoGnothi, a parameter-efficient pipeline that side-tunes lightweight surrogates and explainers to enable faithful Shapley-value explanations without altering the original backbone. It provides theoretical guarantees for surrogate and explainer training and demonstrates substantial reductions in training memory, parameter counts, and inference costs while maintaining or improving explanation quality on vision (ViT) and language (BERT) tasks. AutoGnothi enables single-pass inference for predictions and explanations, offering practical benefits for high-stakes decisions. Overall, the work advances practical self-interpretability for black-box transformers through Ladder Side-Tuning and SURROGATE+EXPLAINER joint optimization that preserves accuracy and fidelity of explanations across domains.

Abstract

The debate between self-interpretable models and post-hoc explanations for black-box models is central to Explainable AI (XAI). Self-interpretable models, such as concept-based networks, offer insights by connecting decisions to human-understandable concepts but often struggle with performance and scalability. Conversely, post-hoc methods like Shapley values, while theoretically robust, are computationally expensive and resource-intensive. To bridge the gap between these two lines of research, we propose a novel method that combines their strengths, providing theoretically guaranteed self-interpretability for black-box models without compromising prediction accuracy. Specifically, we introduce a parameter-efficient pipeline, AutoGnothi, which integrates a small side network into the black-box model, allowing it to generate Shapley value explanations without changing the original network parameters. This side-tuning approach significantly reduces memory, training, and inference costs, outperforming traditional parameter-efficient methods, where full fine-tuning serves as the optimal baseline. AutoGnothi enables the black-box model to predict and explain its predictions with minimal overhead. Extensive experiments show that AutoGnothi offers accurate explanations for both vision and language tasks, delivering superior computational efficiency with comparable interpretability.

Paper Structure

This paper contains 27 sections, 6 theorems, 42 equations, 13 figures, 10 tables.

Key Result

Theorem 1

Let the surrogate model be trained using gradient descent with step size $\alpha$ for $t$ iterations. The expected KL divergence between the original model’s predictions $f(x)$ and the surrogate model’s predictions $g_{\beta}(x_s)$ is upper-bounded by: where $\beta_0$ is the initial parameter value, $\mathcal{L}_{\text{surr}}^\star$ is the optimal value during optimization, and $\mu$ is the minim

Figures (13)

  • Figure 1: Different paradigms towards XAI. (a) The ideal paradigm for XAI envisions using white-box models for prediction, which are inherently self-interpretable by design but hard to achieve. (b) The previous paradigm involves post-hoc explanations of black-box models by training a separate, heavy-weight explainer. (c) We propose a novel parameter-efficient paradigm, AutoGnothi, which fine-tunes the black-box model to make it self-interpretable.
  • Figure 2: Explanation quality on the ImageNette dataset using different ViTs. Our AutoGnothi significantly reduces the number of trainable parameters, computational costs (FLOPs), and training GPU memory storage without compromising explanation quality.
  • Figure 3: Overview of AutoGnothi compared to prior work. (a) ViT-Shapley vitshap fully fine-tunes the black-box model to create a surrogate model, then trains a separate explainer based on the surrogate, which is resource-intensive. (b) We employ side-tuning to efficiently obtain both the black-box model and explainer, significantly reducing training costs. AutoGnothi uses a single model to simultaneously generate predictions and explanations, lowering inference costs by leveraging shared features. In contrast, ViT-Shapley needs to load two models for prediction and explanation, respectively, and infers two times. AutoGnothi enables self-interpretability for an arbitrary black-box model. We ignore the positional encoding associated with the pipeline.
  • Figure 4: Training performance of surrogate and explainer models. (a) Prediction accuracy of masked inputs for the original classifier, the surrogate model trained with ViT Shapley vitshap, and our AutoGnothi. AutoGnothi shows greater robustness as the number of masked patches increases. For each mask size, we randomly sampled 100 images and generated 10 random masks. The curve represents the average prediction probability. (b) Explanation quality, measured by insertion and deletion metrics, for various explanation methods. We randomly sampled 1,000 images and averaged the prediction probabilities to assess insertion and deletion performance. All experiments were conducted on the ImageNette dataset using the ViT-base model.
  • Figure 5: Other pipelines to achieve the self-interpretability through the full fine-tuning. (a) Freeze the transformer encoder and prediction head, learning only the explanation head. (b) Simultaneously learn the transformer encoder, and both task heads. (c) Comparison of classification and explanation performance between different pipelines for ViT-base.
  • ...and 8 more figures

Theorems & Definitions (10)

  • Theorem 1: Proof in Appendix \ref{['app:proofs']}
  • Theorem 2: Proof in Appendix \ref{['app:proofs']}
  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Lemma 2
  • proof
  • Theorem 2
  • proof