FACE: Faithful Automatic Concept Extraction
Dipkamal Bhusal, Michael Clifford, Sara Rampazzi, Nidhi Rastogi
TL;DR
FACE introduces a KL divergence-regularized NMF to learn concept representations that faithfully reflect a model’s downstream predictions. By supervising the factorization with the classifier’s output, FACE provides concept-based explanations that remain consistent with the original decision process, and it offers theoretical guarantees that bound predictive deviation. Empirically, FACE outperforms prior methods on faithfulness and sparsity across ImageNet, COCO, and CelebA, while maintaining competitive reconstruction. The approach advances practical interpretability by delivering semantically coherent yet behaviorally faithful concepts suitable for debugging and trust-building in vision models.
Abstract
Interpreting deep neural networks through concept-based explanations offers a bridge between low-level features and high-level human-understandable semantics. However, existing automatic concept discovery methods often fail to align these extracted concepts with the model's true decision-making process, thereby compromising explanation faithfulness. In this work, we propose FACE (Faithful Automatic Concept Extraction), a novel framework that augments Non-negative Matrix Factorization (NMF) with a Kullback-Leibler (KL) divergence regularization term to ensure alignment between the model's original and concept-based predictions. Unlike prior methods that operate solely on encoder activations, FACE incorporates classifier supervision during concept learning, enforcing predictive consistency and enabling faithful explanations. We provide theoretical guarantees showing that minimizing the KL divergence bounds the deviation in predictive distributions, thereby promoting faithful local linearity in the learned concept space. Systematic evaluations on ImageNet, COCO, and CelebA datasets demonstrate that FACE outperforms existing methods across faithfulness and sparsity metrics.
