Provably Better Explanations with Optimized Aggregation of Feature Attributions

Thomas Decker; Ananta R. Bhattarai; Jindong Gu; Volker Tresp; Florian Buettner

Provably Better Explanations with Optimized Aggregation of Feature Attributions

Thomas Decker, Ananta R. Bhattarai, Jindong Gu, Volker Tresp, Florian Buettner

TL;DR

This work addresses the instability and disagreement among feature attribution methods by proposing Optimized Aggregation, a convex-weighting scheme that combines multiple attributions into a single, provably improved explanation $\phi^{\omega}$. It formulates the problem under a generalized $L2$-metric framework $\mathcal{Q}(\phi(x)) = \mathbb{E}_{\gamma_1,\gamma_2}[\|\gamma_1 \phi(x) - \gamma_2\|_2^2]$ and reduces weight optimization to a constrained quadratic program, yielding global optima and allowing simultaneous improvement across several metrics. The authors prove a decomposition showing aggregated explanations can outperform weighted individuals and provide generalization bounds ensuring the improvements hold with high probability as sample size grows; they also instantiate three aggregation strategies, AGG_robust, AGG_faith, and AGG_opt, targeting robustness, faithfulness, or both. Empirically, across ImageNet models and diverse attribution methods, the optimized aggregations consistently outperform single-method baselines and standard mean/variance aggregations, improving robustness (SENS_AVG/SENS_MAX), faithfulness (INFD/FCOR), and related stability metrics, while also enhancing some individual methods via aggregation. The work suggests practical impact in producing more reliable explanations for opaque models and points to extensions such as voting schemes and broader integration with concept-based or counterfactual explanations.

Abstract

Using feature attributions for post-hoc explanations is a common practice to understand and verify the predictions of opaque machine learning models. Despite the numerous techniques available, individual methods often produce inconsistent and unstable results, putting their overall reliability into question. In this work, we aim to systematically improve the quality of feature attributions by combining multiple explanations across distinct methods or their variations. For this purpose, we propose a novel approach to derive optimal convex combinations of feature attributions that yield provable improvements of desired quality criteria such as robustness or faithfulness to the model behavior. Through extensive experiments involving various model architectures and popular feature attribution techniques, we demonstrate that our combination strategy consistently outperforms individual methods and existing baselines.

Provably Better Explanations with Optimized Aggregation of Feature Attributions

TL;DR

. It formulates the problem under a generalized

-metric framework

and reduces weight optimization to a constrained quadratic program, yielding global optima and allowing simultaneous improvement across several metrics. The authors prove a decomposition showing aggregated explanations can outperform weighted individuals and provide generalization bounds ensuring the improvements hold with high probability as sample size grows; they also instantiate three aggregation strategies, AGG_robust, AGG_faith, and AGG_opt, targeting robustness, faithfulness, or both. Empirically, across ImageNet models and diverse attribution methods, the optimized aggregations consistently outperform single-method baselines and standard mean/variance aggregations, improving robustness (SENS_AVG/SENS_MAX), faithfulness (INFD/FCOR), and related stability metrics, while also enhancing some individual methods via aggregation. The work suggests practical impact in producing more reliable explanations for opaque models and points to extensions such as voting schemes and broader integration with concept-based or counterfactual explanations.

Abstract

Paper Structure (46 sections, 7 theorems, 35 equations, 5 figures, 11 tables)

This paper contains 46 sections, 7 theorems, 35 equations, 5 figures, 11 tables.

Introduction
Problem Setup
Background and Related Work
Measuring attribution quality
Robustness
Faithfulness
Other metrics
Aggregating explanations
Optimizing Explanations with Aggregation
Generalized L2 metrics for explanations
Deriving optimal weights
Provable improvement through aggregation
Generalization bounds for estimated weights
Optimal aggregation for desired improvements
Experiments
...and 31 more sections

Key Result

Theorem 4.2

Let $\phi^{\omega}=\sum_i \omega_i \phi^i$ be the aggregated explanation, then the quality metric of $\phi^{\omega}$ is always at least as good as the weighted metrics of the individual attributions:

Figures (5)

Figure 1: Disagreement across attribution methods (left): Different feature attribution methods $(\phi^1, \dots , \phi^5)$ provide distinct perspectives about which particular features of an input $x$ are important for an opaque model prediction $f(x)$. Oftentimes they tend to disagree causing ambiguity about which inputs truly matter. Our Optimized Aggregation approach (right): We study how to combine all individual attribution results fruitfully to attain better explanations. We propose a novel aggregation approach to retrieve optimal convex weights $\omega_i$ such that the aggregated feature attribution $\phi^{\omega} = \sum_i \omega_i \phi^i$ is provably more robust and more faithful to the underlying model.
Figure 2: Individual outcomes of different feature attribution methods as well as our approach $\mathbf{\text{AGG}_{\textit{opt}}}$(right column) for seven images based on VGG16 (row 1-5) and DeiT (row 6-8). In addition to the quantitative improvements established in section 5.1 for robustness and faithfulness, our aggregation strategy also produces visually more intuitive and convincing explanations. It succeeds in enhancing the attribution results by combining several valid perspectives to complement each other (e.g rows 1 and 2) and by automatically discarding seemingly deteriorated explanations (e.g. rows 7 and 8).
Figure 3: Individual attribution results of different LIME variants varying by superpixel structure and sparsity regularization on VGG16. The object to be classified is rather small and $\text{AGG}_{\textit{opt}}$ automatically combines only the sparsest explanations to enhance the explanation.
Figure 4: Boxplots of aggregation weights obtained by $\text{AGG}_{\textit{opt}}$ for the two considered sets of attribution methods during the evaluations in section 5.1 for robustness (top) and faithfulness (bottom) based on 500 samples. For each method, the allocated weight differs substantially among samples as most distributions cover almost the entire range between 0 and 1. There is also high variability across models indicating that a single method alone is unable to provide a reliable explanation for every prediction consistently.
Figure 5: Average aggregation weights obtained by $\text{AGG}_{\textit{opt}}$ while optimizing the results from different versions of LIME on VGG16 (top) and ViT (bottom) including 95% confidence intervals as error bars. For smaller objects, significantly more weight is put on higher sparsity regularization.

Theorems & Definitions (12)

Definition 4.1
Theorem 4.2
Theorem 4.3
Theorem 1.1
proof
Theorem 1.2: 26.5.3 in shalev2014understanding
Lemma 1.3
proof
Lemma 1.4
proof
...and 2 more

Provably Better Explanations with Optimized Aggregation of Feature Attributions

TL;DR

Abstract

Provably Better Explanations with Optimized Aggregation of Feature Attributions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (12)