Provably Better Explanations with Optimized Aggregation of Feature Attributions
Thomas Decker, Ananta R. Bhattarai, Jindong Gu, Volker Tresp, Florian Buettner
TL;DR
This work addresses the instability and disagreement among feature attribution methods by proposing Optimized Aggregation, a convex-weighting scheme that combines multiple attributions into a single, provably improved explanation $\phi^{\omega}$. It formulates the problem under a generalized $L2$-metric framework $\mathcal{Q}(\phi(x)) = \mathbb{E}_{\gamma_1,\gamma_2}[\|\gamma_1 \phi(x) - \gamma_2\|_2^2]$ and reduces weight optimization to a constrained quadratic program, yielding global optima and allowing simultaneous improvement across several metrics. The authors prove a decomposition showing aggregated explanations can outperform weighted individuals and provide generalization bounds ensuring the improvements hold with high probability as sample size grows; they also instantiate three aggregation strategies, AGG_robust, AGG_faith, and AGG_opt, targeting robustness, faithfulness, or both. Empirically, across ImageNet models and diverse attribution methods, the optimized aggregations consistently outperform single-method baselines and standard mean/variance aggregations, improving robustness (SENS_AVG/SENS_MAX), faithfulness (INFD/FCOR), and related stability metrics, while also enhancing some individual methods via aggregation. The work suggests practical impact in producing more reliable explanations for opaque models and points to extensions such as voting schemes and broader integration with concept-based or counterfactual explanations.
Abstract
Using feature attributions for post-hoc explanations is a common practice to understand and verify the predictions of opaque machine learning models. Despite the numerous techniques available, individual methods often produce inconsistent and unstable results, putting their overall reliability into question. In this work, we aim to systematically improve the quality of feature attributions by combining multiple explanations across distinct methods or their variations. For this purpose, we propose a novel approach to derive optimal convex combinations of feature attributions that yield provable improvements of desired quality criteria such as robustness or faithfulness to the model behavior. Through extensive experiments involving various model architectures and popular feature attribution techniques, we demonstrate that our combination strategy consistently outperforms individual methods and existing baselines.
