Post-hoc Part-prototype Networks
Andong Tan, Fengtao Zhou, Hao Chen
TL;DR
This work addresses the need for global interpretability in deep vision models by proposing a post-hoc part-prototype network that decomposes a trained classification head into $k$ interpretable prototypes, satisfying $\mathbf{v} = \sum_{i=1}^k \tilde{\mathbf{p}}_i$ and producing heatmaps $\mathbf{x} \tilde{\mathbf{p}}_i^T$. Prototypes are discovered via unsupervised NMF and refined through a residual-distribution process to exactly reconstruct the head while maintaining interpretability, with optimization steps including $\min_{\mathbf{E},\mathbf{P}} \|\mathbf{F} - \mathbf{E P}\|_2^2$ and $\min_{\alpha_i} \|\mathbf{v} - \sum_{i=1}^k \alpha_i \mathbf{p}_i\|_2^2$, followed by Nelder–Mead refinement. The approach guarantees performance and yields more faithful explanations than prior methods, demonstrated by explainability axioms and quantitative metrics across multiple backbones and large-scale datasets like ImageNet. It enables scalable, post-hoc, global interpretability for complex models by linking specific prototypes to semantically meaningful object parts. The work highlights a practical path toward transparent AI systems without retraining, while acknowledging limitations tied to the pretrained head’s feature space.
Abstract
Post-hoc explainability methods such as Grad-CAM are popular because they do not influence the performance of a trained model. However, they mainly reveal "where" a model looks at for a given input, fail to explain "what" the model looks for (e.g., what is important to classify a bird image to a Scott Oriole?). Existing part-prototype networks leverage part-prototypes (e.g., characteristic Scott Oriole's wing and head) to answer both "where" and "what", but often under-perform their black box counterparts in the accuracy. Therefore, a natural question is: can one construct a network that answers both "where" and "what" in a post-hoc manner to guarantee the model's performance? To this end, we propose the first post-hoc part-prototype network via decomposing the classification head of a trained model into a set of interpretable part-prototypes. Concretely, we propose an unsupervised prototype discovery and refining strategy to obtain prototypes that can precisely reconstruct the classification head, yet being interpretable. Besides guaranteeing the performance, we show that our network offers more faithful explanations qualitatively and yields even better part-prototypes quantitatively than prior part-prototype networks.
