Enhancing Model Interpretability with Local Attribution over Global Exploration

Zhiyu Zhu; Zhibo Jin; Jiayu Zhang; Huaming Chen

Enhancing Model Interpretability with Local Attribution over Global Exploration

Zhiyu Zhu, Zhibo Jin, Jiayu Zhang, Huaming Chen

TL;DR

This work addresses the fragility of attribution methods caused by intermediate states that fall outside the in-distribution region. It introduces Local Space and the Local Attribution (LA) algorithm, which combines targeted and untargeted adversarial exploration within a locality defined by $B_{\boldsymbol{\epsilon}}(x)$ to preserve meaningful decision boundaries. The approach is grounded in formal problem definitions, a priori axioms, and proofs, and demonstrates substantial gains over 11 baselines across four networks on ImageNet, with an average Insertion improvement of $0.31758$ and Deletion reduction of $0.028883$. Extensive ablations validate the importance of linear space constraints, mixed attack strategies, sampling counts, and spatial range, underscoring the practical impact of maintaining locality in attribution. The work advances XAI by offering a provable, scalable method that yields more faithful explanations and provides code for the community.

Abstract

In the field of artificial intelligence, AI models are frequently described as `black boxes' due to the obscurity of their internal mechanisms. It has ignited research interest on model interpretability, especially in attribution methods that offers precise explanations of model decisions. Current attribution algorithms typically evaluate the importance of each parameter by exploring the sample space. A large number of intermediate states are introduced during the exploration process, which may reach the model's Out-of-Distribution (OOD) space. Such intermediate states will impact the attribution results, making it challenging to grasp the relative importance of features. In this paper, we firstly define the local space and its relevant properties, and we propose the Local Attribution (LA) algorithm that leverages these properties. The LA algorithm comprises both targeted and untargeted exploration phases, which are designed to effectively generate intermediate states for attribution that thoroughly encompass the local space. Compared to the state-of-the-art attribution methods, our approach achieves an average improvement of 38.21\% in attribution effectiveness. Extensive ablation studies in our experiments also validate the significance of each component in our algorithm. Our code is available at: https://github.com/LMBTough/LA/

Enhancing Model Interpretability with Local Attribution over Global Exploration

TL;DR

to preserve meaningful decision boundaries. The approach is grounded in formal problem definitions, a priori axioms, and proofs, and demonstrates substantial gains over 11 baselines across four networks on ImageNet, with an average Insertion improvement of

and Deletion reduction of

. Extensive ablations validate the importance of linear space constraints, mixed attack strategies, sampling counts, and spatial range, underscoring the practical impact of maintaining locality in attribution. The work advances XAI by offering a provable, scalable method that yields more faithful explanations and provides code for the community.

Abstract

Paper Structure (30 sections, 2 theorems, 5 equations, 23 figures, 4 tables, 1 algorithm)

This paper contains 30 sections, 2 theorems, 5 equations, 23 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Local Approximation Methods
Gradient-based Attribution Methods
Adversarial-sample-based Attribution Methods
Method
Problem Definition
Local Space of Attribution
Deep Analysis of Untargeted and Targeted Adversarial Attacks
Local space sampling optimization
Experiments
Dataset and Models
Baselines
Evaluated Metrics
Parameters
...and 15 more sections

Key Result

theorem 1

Given a sample $x$, the $\epsilon\text{-Local Space}$ of $x$, denoted as $B_{\epsilon}(x)$, is defined as: where $\epsilon \in \mathbb{R}^n$ and $\epsilon_i = \frac{x_i}{s}$, with $s$ being a hyperparameter that controls the size of the local space (Spatial Range).

Figures (23)

Figure 1: A vast amount of Out-of-Distribution (OOD) space exists outside the defined Local Space, where samples within the OOD space lack guidance for attribution. Furthermore, the use of both untargeted and targeted attacks enables the exploration of a possibly comprehensive Local Space. This aspect was discussed in depth from the perspective of the loss function in Section \ref{['Sec:Attack']}.
Figure 2: When adversarial attacks exceed two iterations, the model essentially lacks the current category’s characteristics, and subsequent samples in the OOD space no longer guide the attribution algorithm meaningfully.
Figure 3: After removing the important features, the relative importance of the remaining features is not as significant. As shown in (a), the features in the red area are notably more important for the category of cats compared to those in the blue area. However, as depicted in (b), after the cat features have been removed, it becomes challenging to assess the importance of the remaining features.
Figure 4: Visual comparison of the attribution effects of LA and other competing algorithms on the Inception-v3
Figure 5: Visual comparison of the attribution effects of LA and other competing algorithms on the MaxViT-T
...and 18 more figures

Theorems & Definitions (2)

theorem 1: Local Space
theorem 2: LA

Enhancing Model Interpretability with Local Attribution over Global Exploration

TL;DR

Abstract

Enhancing Model Interpretability with Local Attribution over Global Exploration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (23)

Theorems & Definitions (2)