InfoDisent: Explainability of Image Classification Models by Information Disentanglement
Łukasz Struski, Dawid Rymarczyk, Jacek Tabor
TL;DR
InfoDisent tackles the Explainable AI gap by uniting the flexibility of post-hoc explanations with the interpretability of prototypical concepts through information disentanglement. It introduces a novel architecture that freezes a pretrained backbone and learns an orthogonal channel transform, sparse pooling, and a nonnegative final classifier to produce atomic, prototypical parts that can be visualized as patches and heatmaps. The method yields local and global explanations, supports both positive and negative contributions, and generalizes prototypical explanations to large-scale datasets like ImageNet. Empirical results across datasets and user studies show competitive accuracy and strong human interpretability, with statistically significant improvements in user understanding and disambiguation compared to baselines. Overall, InfoDisent provides a scalable, backbone-agnostic XAI framework with practical implications for deploying interpretable models in real-world, high-stakes settings.
Abstract
In this work, we introduce InfoDisent, a hybrid approach to explainability based on the information bottleneck principle. InfoDisent enables the disentanglement of information in the final layer of any pretrained model into atomic concepts, which can be interpreted as prototypical parts. This approach merges the flexibility of post-hoc methods with the concept-level modeling capabilities of self-explainable neural networks, such as ProtoPNets. We demonstrate the effectiveness of InfoDisent through computational experiments and user studies across various datasets using modern backbones such as ViTs and convolutional networks. Notably, InfoDisent generalizes the prototypical parts approach to novel domains (ImageNet).
