COMIX: Compositional Explanations using Prototypes
Sarath Sivaprasad, Dmitry Kangin, Plamen Angelov, Mario Fritz
TL;DR
COMiX addresses the interpretability gap by offering by-design explanations that faithfully reflect the model's decision process. It builds a B-Cos-based encoder to obtain interpretable embeddings, selects a small set of class-defining features via mutual information and pseudo-labels, and retrieves prototypical training-region matches to justify predictions through per-feature prototype explanations and majority voting. The approach yields high fidelity and sparsity, demonstrates strong interpretability metrics (including a notable $48.82\%$ improvement in C-insertion on ImageNet), and exhibits competitive accuracy across diverse datasets with potential zero-shot generalization. By linking test decisions directly to training data, COMiX enables both factual and counterfactual interpretations, and opens avenues for segmentation and safety-critical deployments, while maintaining reproducibility through detailed appendices and forthcoming code release.
Abstract
Aligning machine representations with human understanding is key to improving interpretability of machine learning (ML) models. When classifying a new image, humans often explain their decisions by decomposing the image into concepts and pointing to corresponding regions in familiar images. Current ML explanation techniques typically either trace decision-making processes to reference prototypes, generate attribution maps highlighting feature importance, or incorporate intermediate bottlenecks designed to align with human-interpretable concepts. The proposed method, named COMIX, classifies an image by decomposing it into regions based on learned concepts and tracing each region to corresponding ones in images from the training dataset, assuring that explanations fully represent the actual decision-making process. We dissect the test image into selected internal representations of a neural network to derive prototypical parts (primitives) and match them with the corresponding primitives derived from the training data. In a series of qualitative and quantitative experiments, we theoretically prove and demonstrate that our method, in contrast to post hoc analysis, provides fidelity of explanations and shows that the efficiency is competitive with other inherently interpretable architectures. Notably, it shows substantial improvements in fidelity and sparsity metrics, including 48.82% improvement in the C-insertion score on the ImageNet dataset over the best state-of-the-art baseline.
