Model Agnostic Contrastive Explanations for Structured Data
Amit Dhurandhar, Tejaswini Pedapati, Avinash Balakrishnan, Pin-Yu Chen, Karthikeyan Shanmugam, Ruchir Puri
TL;DR
This paper introduces MACEM, a model-agnostic method to generate contrastive explanations for structured data by querying only class probabilities. It defines Pertinent Positives and Pertinent Negatives as sparsest and closest perturbations relative to base values, solved via a projected FISTA in a black-box setting with zeroth-order gradient estimation. The approach handles real and categorical features through two strategies (FMA and SSA) and demonstrates superior, faithful explanations compared to LIME across five datasets, including qualitative expert assessments. MACEM's emphasis on contrastive, trustworthy explanations with minimal input changes offers practical benefits for regulatory and domain-specific explainability needs. The work also outlines directions to extend to unstructured data and more complex modalities.
Abstract
Recently, a method [7] was proposed to generate contrastive explanations for differentiable models such as deep neural networks, where one has complete access to the model. In this work, we propose a method, Model Agnostic Contrastive Explanations Method (MACEM), to generate contrastive explanations for \emph{any} classification model where one is able to \emph{only} query the class probabilities for a desired input. This allows us to generate contrastive explanations for not only neural networks, but models such as random forests, boosted trees and even arbitrary ensembles that are still amongst the state-of-the-art when learning on structured data [13]. Moreover, to obtain meaningful explanations we propose a principled approach to handle real and categorical features leading to novel formulations for computing pertinent positives and negatives that form the essence of a contrastive explanation. A detailed treatment of the different data types of this nature was not performed in the previous work, which assumed all features to be positive real valued with zero being indicative of the least interesting value. We part with this strong implicit assumption and generalize these methods so as to be applicable across a much wider range of problem settings. We quantitatively and qualitatively validate our approach over 5 public datasets covering diverse domains.
