The Susceptibility of Example-Based Explainability Methods to Class Outliers
Ikhtiyor Nematov, Dimitris Sacharidis, Tomer Sagi, Katja Hose
TL;DR
The paper addresses how class outliers—high-loss, ambiguous training examples—affect local, example-based explanations for black-box models. It reformulates evaluation metrics around relevance, distinguishability, and correctness, and introduces a framework that treats outliers as potentially informative rather than solely suppressible. Through experiments on SMS Spam text and a dog-vs-fish image task with four explainers (IF, RIF, DM, TraceIn) across explanation sizes $N \in \{2,5,10\}$, the study shows that outliers degrade relevance and distinguishability, and that suppression strategies can hurt correctness. The findings highlight the need for robust, outlier-aware explainability methods and metrics that recognize the value of informative, ambiguous instances in model explanations.
Abstract
This study explores the impact of class outliers on the effectiveness of example-based explainability methods for black-box machine learning models. We reformulate existing explainability evaluation metrics, such as correctness and relevance, specifically for example-based methods, and introduce a new metric, distinguishability. Using these metrics, we highlight the shortcomings of current example-based explainability methods, including those who attempt to suppress class outliers. We conduct experiments on two datasets, a text classification dataset and an image classification dataset, and evaluate the performance of four state-of-the-art explainability methods. Our findings underscore the need for robust techniques to tackle the challenges posed by class outliers.
