Table of Contents
Fetching ...

FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer, Bernt Schiele

TL;DR

FaCT tackles the challenge of explaining neural decisions at a concept level by embedding faithful concept representations directly into the forward pass using B-cos transforms and bias-free Sparse Autoencoders. It enables each logit to be decomposed into contributions from interpretable concepts and provides input-grounded visualizations for every concept, while allowing cross-layer and cross-class concept hierarchies. To evaluate concept quality without human priors, the authors introduce the $C^2$-score, a foundation-model–based consistency metric that correlates with human interpretability. Empirically, FaCT yields diverse, shared concepts across CNNs and ViTs, achieves competitive ImageNet accuracy, and demonstrates superior concept consistency and interpretability compared with prior methods, offering a robust framework for trustworthy, faith-based explanations.

Abstract

Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C$^2$-Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.

FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

TL;DR

FaCT tackles the challenge of explaining neural decisions at a concept level by embedding faithful concept representations directly into the forward pass using B-cos transforms and bias-free Sparse Autoencoders. It enables each logit to be decomposed into contributions from interpretable concepts and provides input-grounded visualizations for every concept, while allowing cross-layer and cross-class concept hierarchies. To evaluate concept quality without human priors, the authors introduce the -score, a foundation-model–based consistency metric that correlates with human interpretability. Empirically, FaCT yields diverse, shared concepts across CNNs and ViTs, achieves competitive ImageNet accuracy, and demonstrates superior concept consistency and interpretability compared with prior methods, offering a robust framework for trustworthy, faith-based explanations.

Abstract

Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C-Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.

Paper Structure

This paper contains 32 sections, 23 equations, 23 figures, 2 tables.

Figures (23)

  • Figure 1: It All Adds Up: Our proposed model FaCT offers a faithful concept-decomposition across layers with a shared basis across classes, e.g., the late-layer 'wheel' concept or early-layer 'yellow' concept are shared across classes and used by the model. Further, every concept is faithfully visualized at input-level ( Concept Activation = $\sum$ Pixel Contributions) and every logit is faithfully explained at concept-level ( Logit = $\sum$ Concept Contributions), e.g. yellow-color concept contributes to 4.3% of School Bus logit. Also contributions between different concept layers can be faithfully computed.
  • Figure 2: Overview of FaCT.
  • Figure 3: Beyond Annotations: We observe that annotations he2022partimagenet fail to capture our concepts, either by not having them annotated ('ship mast' top-right) or not matching the granularity ('dog blaze' bottom-right). Our proposed C$^2$-score tackles this by considering concept attributions together with DINOv2 features, leading to a class-agnostic evaluation framework. See also \ref{['supp:dino-annot']}.
  • Figure 4: Evaluating Concept Consistency: We evaluate the C$^2$-score (cf. \ref{['sec:method-dino']}) for both FaCT's concepts and prior work's. (left) we plot the percentage of concepts for different consistency ranges, finding our concepts to be more consistent than those of prior work. (right) We randomly sampled concepts from different ranges of C$^2$-score, to demonstrate the effectiveness of the C$^2$-score. Notice that the C$^2$-score correctly assigns high consistency to the 'helmet' or 'muzzle' concepts, despite them being shared across classes. See also \ref{['supp:dino']}.
  • Figure 5: FaCT for Diverse Concepts.(left): We observe significant gains in terms of concept-consistency for FaCT compared to B-cos channels. This holds across architectures (columns) and layers (points) with competitive performance on ImageNet (largest drop < 3%), see \ref{['supp:imn']} for further analysis and comparison to standard models. (right): We observe high diversity for our concepts in terms of spatial extent (top) and show samples at the bottom. See also \ref{['supp:diversity']}.
  • ...and 18 more figures