Distribution-Based Feature Attribution for Explaining the Predictions of Any Classifier
Xinpeng Li, Kai Ming Ting
TL;DR
This work formalizes feature attribution as explanations supported by the underlying data distribution and introduces DFAX, a model-agnostic method that explains classifier predictions directly from data density using 1D KDE in feature subspaces. By estimating class-conditioned densities and computing the difference between the target class and other classes, DFAX provides distribution-consistent attributions that fully leverage the dataset X. Extensive quantitative and qualitative experiments across ten diverse datasets show DFAX outperforms state-of-the-art baselines in deletion/insertion fidelity and produces more interpretable explanations, while also delivering substantial runtime advantages. The approach decouples attribution from classifier queries and demonstrates practical impact for trustworthy, distribution-grounded explanations in real-world applications.
Abstract
The proliferation of complex, black-box AI models has intensified the need for techniques that can explain their decisions. Feature attribution methods have become a popular solution for providing post-hoc explanations, yet the field has historically lacked a formal problem definition. This paper addresses this gap by introducing a formal definition for the problem of feature attribution, which stipulates that explanations be supported by an underlying probability distribution represented by the given dataset. Our analysis reveals that many existing model-agnostic methods fail to meet this criterion, while even those that do often possess other limitations. To overcome these challenges, we propose Distributional Feature Attribution eXplanations (DFAX), a novel, model-agnostic method for feature attribution. DFAX is the first feature attribution method to explain classifier predictions directly based on the data distribution. We show through extensive experiments that DFAX is more effective and efficient than state-of-the-art baselines.
