Table of Contents
Fetching ...

Beyond Output Faithfulness: Learning Attributions that Preserve Computational Pathways

Siyu Zhang, Kenneth Mcmillan

TL;DR

This work tackles the limitation that traditional faithfulness metrics (insertion/deletion) may yield externally faithful yet mechanistically misleading explanations. It introduces FEI, a framework that jointly optimizes external faithfulness via differentiable Ensemble Quantile Optimization and internal faithfulness via activation-preserving selective gradient clipping. Across CNNs and datasets, FEI achieves state-of-the-art external scores while maintaining strong activation consistency, demonstrating that explanations must respect both outcome alignment and the model's computational pathway. The approach offers practical, robust attribution maps and highlights the need to consider internal mechanistic integrity in interpretability research.

Abstract

Faithfulness metrics such as insertion and deletion evaluate how feature removal affects model outputs but overlook whether explanations preserve the computational pathway the network actually uses. We show that external metrics can be maximized through alternative pathways -- perturbations that reroute computation via different feature detectors while preserving output behavior. To address this, we propose activation preservation as a tractable proxy for preserving computational pathways We introduce Faithfulness-guided Ensemble Interpretation (FEI), which jointly optimizes external faithfulness (via ensemble quantile optimization of insertion/deletion curves) and internal faithfulness (via selective gradient clipping). Across VGG and ResNet on ImageNet and CUB-200-2011, FEI achieves state-of-the-art insertion/deletion scores while maintaining significantly lower activation deviation, showing that both external and internal faithfulness are essential for reliable explanations.

Beyond Output Faithfulness: Learning Attributions that Preserve Computational Pathways

TL;DR

This work tackles the limitation that traditional faithfulness metrics (insertion/deletion) may yield externally faithful yet mechanistically misleading explanations. It introduces FEI, a framework that jointly optimizes external faithfulness via differentiable Ensemble Quantile Optimization and internal faithfulness via activation-preserving selective gradient clipping. Across CNNs and datasets, FEI achieves state-of-the-art external scores while maintaining strong activation consistency, demonstrating that explanations must respect both outcome alignment and the model's computational pathway. The approach offers practical, robust attribution maps and highlights the need to consider internal mechanistic integrity in interpretability research.

Abstract

Faithfulness metrics such as insertion and deletion evaluate how feature removal affects model outputs but overlook whether explanations preserve the computational pathway the network actually uses. We show that external metrics can be maximized through alternative pathways -- perturbations that reroute computation via different feature detectors while preserving output behavior. To address this, we propose activation preservation as a tractable proxy for preserving computational pathways We introduce Faithfulness-guided Ensemble Interpretation (FEI), which jointly optimizes external faithfulness (via ensemble quantile optimization of insertion/deletion curves) and internal faithfulness (via selective gradient clipping). Across VGG and ResNet on ImageNet and CUB-200-2011, FEI achieves state-of-the-art insertion/deletion scores while maintaining significantly lower activation deviation, showing that both external and internal faithfulness are essential for reliable explanations.

Paper Structure

This paper contains 40 sections, 28 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: External metrics alone are insufficient. (b) FEI$_\text{NONE}$: direct optimization of insertion/deletion without constraints produces noisy artifacts despite the highest metric scores. (c) EP fong2019understanding: approximate optimization with smoothness regularization. (d) FEI$_\text{VM}$: precise optimization with internal pathway constraints yields coherent attributions while maintaining competitive external scores.
  • Figure 2: Internal faithfulness analysis. Quantitative metrics (left) and qualitative visualizations (right) jointly demonstrate that FEI preserves both internal structure and meaningful spatial patterns.
  • Figure 3: Visual comparison of attribution methods. Columns show FEI variants, Extremal Perturbation (EP), RISE, IIA, Lift-CAM, and FG-VCE. Constrained FEI variants yield more focused, object-aligned explanations compared to baselines, which often highlight background or diffuse regions.
  • Figure 4: Sanity check via cascading randomization. As layers are progressively randomized (left to right, top to bottom), attributions gradually lose object focus, indicating appropriate parameter sensitivity. Example: Sea Lion with VGG16 and FEI$_\text{IBM}$.
  • Figure 5: Ablation Study: Impact of Layer Clipping Ranges on Visual Explanations
  • ...and 6 more figures