Table of Contents
Fetching ...

Explaining Black-box Model Predictions via Two-level Nested Feature Attributions with Consistency Property

Yuya Yoshikawa, Masanari Kimura, Ryotaro Shimizu, Yuki Saito

TL;DR

This work tackles explaining black-box predictions for nested inputs by proposing Consistent Two-level Feature Attribution (C2FA), a model-agnostic local explanation that jointly estimates high-level and low-level attributions under a consistency constraint $\alpha_j = \sum_{d=1}^{D_j} \beta_{jd}$. It introduces surrogate models $e^{\mathrm{H}}$ and $e^{\mathrm{L}}$ trained via perturbations and solved with ADMM to enforce cross-level consistency, allowing regularization and non-negativity constraints. Theoretical results establish existence, uniqueness, and convergence of the method, along with high-probability guarantees. Empirically, C2FA achieves accurate, faithful, and consistent HiFAs/LoFAs on MIL image classification and text classification with language models, and requires fewer model queries than prior approaches. The method offers practical benefits for interpretable AI in nested data domains and holds promise for broader applicability with extended nested structures.

Abstract

Techniques that explain the predictions of black-box machine learning models are crucial to make the models transparent, thereby increasing trust in AI systems. The input features to the models often have a nested structure that consists of high- and low-level features, and each high-level feature is decomposed into multiple low-level features. For such inputs, both high-level feature attributions (HiFAs) and low-level feature attributions (LoFAs) are important for better understanding the model's decision. In this paper, we propose a model-agnostic local explanation method that effectively exploits the nested structure of the input to estimate the two-level feature attributions simultaneously. A key idea of the proposed method is to introduce the consistency property that should exist between the HiFAs and LoFAs, thereby bridging the separate optimization problems for estimating them. Thanks to this consistency property, the proposed method can produce HiFAs and LoFAs that are both faithful to the black-box models and consistent with each other, using a smaller number of queries to the models. In experiments on image classification in multiple instance learning and text classification using language models, we demonstrate that the HiFAs and LoFAs estimated by the proposed method are accurate, faithful to the behaviors of the black-box models, and provide consistent explanations.

Explaining Black-box Model Predictions via Two-level Nested Feature Attributions with Consistency Property

TL;DR

This work tackles explaining black-box predictions for nested inputs by proposing Consistent Two-level Feature Attribution (C2FA), a model-agnostic local explanation that jointly estimates high-level and low-level attributions under a consistency constraint . It introduces surrogate models and trained via perturbations and solved with ADMM to enforce cross-level consistency, allowing regularization and non-negativity constraints. Theoretical results establish existence, uniqueness, and convergence of the method, along with high-probability guarantees. Empirically, C2FA achieves accurate, faithful, and consistent HiFAs/LoFAs on MIL image classification and text classification with language models, and requires fewer model queries than prior approaches. The method offers practical benefits for interpretable AI in nested data domains and holds promise for broader applicability with extended nested structures.

Abstract

Techniques that explain the predictions of black-box machine learning models are crucial to make the models transparent, thereby increasing trust in AI systems. The input features to the models often have a nested structure that consists of high- and low-level features, and each high-level feature is decomposed into multiple low-level features. For such inputs, both high-level feature attributions (HiFAs) and low-level feature attributions (LoFAs) are important for better understanding the model's decision. In this paper, we propose a model-agnostic local explanation method that effectively exploits the nested structure of the input to estimate the two-level feature attributions simultaneously. A key idea of the proposed method is to introduce the consistency property that should exist between the HiFAs and LoFAs, thereby bridging the separate optimization problems for estimating them. Thanks to this consistency property, the proposed method can produce HiFAs and LoFAs that are both faithful to the black-box models and consistent with each other, using a smaller number of queries to the models. In experiments on image classification in multiple instance learning and text classification using language models, we demonstrate that the HiFAs and LoFAs estimated by the proposed method are accurate, faithful to the behaviors of the black-box models, and provide consistent explanations.
Paper Structure (33 sections, 5 theorems, 35 equations, 12 figures, 1 algorithm)

This paper contains 33 sections, 5 theorems, 35 equations, 12 figures, 1 algorithm.

Key Result

Lemma 4.6

Under assumptions, there exists at least one global minimizer $(\boldsymbol{\alpha}^*, \boldsymbol{\beta}^*)$ of C2FA optimization.

Figures (12)

  • Figure 1: Example of the black-box model prediction for a nested structured input and its corresponding high- and low-level feature attributions estimated by the proposed method with consistency constraints. Objects in each high-level feature represent the low-level features.
  • Figure 2: Quantitative evaluation on the image classification task. (a) NDCG (higher is better) and deletion scores (lower is better) of the estimated HiFAs. (b) AUROC (higher is better) and deletion scores (lower is better) of the estimated LoFAs. (c) Consistency scores (lower is better) and the agreement scores of MIHL (higher is better). The error bars represent the standard deviations of the scores over three runs with different random seeds.
  • Figure 3: Example of the estimated HiFAs and LoFAs on the image classification task when $N_{\rm{H}}=20$ and $N_{\rm{L}}=50$. The input is shown on the first row, where the image with the red border is the positive instance. The LoFAs of super-pixels estimated by the proposed method and LIME are shown in the second and third rows, respectively. In each case, super-pixels are highlighted in green, with the intensity of the green color indicating the magnitude of the LoFA.
  • Figure 4: Quantitative evaluation on the text classification task. (a) Deletion scores of the HiFAs and LoFAs (lower is better). (b) Consistency scores (lower is better) and agreement scores of MIHL (higher is better).
  • Figure 5: Example of the estimated HiFAs and LoFAs for a negative review text when $N_{\rm{H}}=50$ and $N_{\rm{L}}=50$. The review text is shown at the top, and the HiFAs (left) and the **top-5 LoFAs (right) estimated by each method are shown at the bottom. Here, the words highlighted with a pink background in the review correspond to those top-5 LoFAs in the chart
  • ...and 7 more figures

Theorems & Definitions (10)

  • Lemma 4.6: Existence of a Global Minimizer
  • Lemma 4.7: Strong Convexity and Uniqueness
  • Theorem 4.8
  • Theorem 4.9: High-Probability Convergence of C2FA
  • Corollary 4.10: Uniform Approximation Guarantee
  • proof : Proof for Lemma \ref{['lem:existence_of_a_global_minimizer']}
  • proof : Proof for Lemma \ref{['lem:strong_convexity_and_uniqueness']}
  • proof : Proof for Theorem \ref{['thm:strictly_reduce_objective']}
  • proof : Proof for Theorem \ref{['thm:hp-convergence-c2fa']}
  • proof : Proof for Corollary \ref{['cor:uniform-approx']}