Table of Contents
Fetching ...

Improving Network Interpretability via Explanation Consistency Evaluation

Hefeng Wu, Hao Jiang, Keze Wang, Ziyi Tang, Xianghuan He, Liang Lin

TL;DR

This work tackles the interpretability–performance trade-off in deep nets by introducing explanation-consistency learning, where training samples are reweighted according to how stably their heatmaps and predictions survive semantic-preserved adversarial perturbations. The framework defines $E(x_i, \hat{x}_i)$ to quantify explanation robustness and uses $v_i=1-E(x_i, \hat{x}_i)$ to bias learning toward hard explanations, formalized through Loss = $\sum_i v_i L_i(x_i,y_i)$. An iterative pipeline trains the model, assesses explanation consistency via semantic-preserved attacks, and updates sample weights to improve both accuracy and heatmap quality without extra supervision. Empirical results across STL-10, VOC, CUB-200-2011, and ImageNet-9 show consistent gains in recognition and interpretability for regular and interpretable networks, along with debiasing and robustness benefits. The approach also offers flexible extensions to multi-label tasks, various architectures, and potential integration with vision-language models.

Abstract

While deep neural networks have achieved remarkable performance, they tend to lack transparency in prediction. The pursuit of greater interpretability in neural networks often results in a degradation of their original performance. Some works strive to improve both interpretability and performance, but they primarily depend on meticulously imposed conditions. In this paper, we propose a simple yet effective framework that acquires more explainable activation heatmaps and simultaneously increase the model performance, without the need for any extra supervision. Specifically, our concise framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning. The explanation consistency metric is utilized to measure the similarity between the model's visual explanations of the original samples and those of semantic-preserved adversarial samples, whose background regions are perturbed by using image adversarial attack techniques. Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations (i.e., low explanation consistency), for which the current model cannot provide robust interpretations. Comprehensive experimental results on various benchmarks demonstrate the superiority of our framework in multiple aspects, including higher recognition accuracy, greater data debiasing capability, stronger network robustness, and more precise localization ability on both regular networks and interpretable networks. We also provide extensive ablation studies and qualitative analyses to unveil the detailed contribution of each component.

Improving Network Interpretability via Explanation Consistency Evaluation

TL;DR

This work tackles the interpretability–performance trade-off in deep nets by introducing explanation-consistency learning, where training samples are reweighted according to how stably their heatmaps and predictions survive semantic-preserved adversarial perturbations. The framework defines to quantify explanation robustness and uses to bias learning toward hard explanations, formalized through Loss = . An iterative pipeline trains the model, assesses explanation consistency via semantic-preserved attacks, and updates sample weights to improve both accuracy and heatmap quality without extra supervision. Empirical results across STL-10, VOC, CUB-200-2011, and ImageNet-9 show consistent gains in recognition and interpretability for regular and interpretable networks, along with debiasing and robustness benefits. The approach also offers flexible extensions to multi-label tasks, various architectures, and potential integration with vision-language models.

Abstract

While deep neural networks have achieved remarkable performance, they tend to lack transparency in prediction. The pursuit of greater interpretability in neural networks often results in a degradation of their original performance. Some works strive to improve both interpretability and performance, but they primarily depend on meticulously imposed conditions. In this paper, we propose a simple yet effective framework that acquires more explainable activation heatmaps and simultaneously increase the model performance, without the need for any extra supervision. Specifically, our concise framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning. The explanation consistency metric is utilized to measure the similarity between the model's visual explanations of the original samples and those of semantic-preserved adversarial samples, whose background regions are perturbed by using image adversarial attack techniques. Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations (i.e., low explanation consistency), for which the current model cannot provide robust interpretations. Comprehensive experimental results on various benchmarks demonstrate the superiority of our framework in multiple aspects, including higher recognition accuracy, greater data debiasing capability, stronger network robustness, and more precise localization ability on both regular networks and interpretable networks. We also provide extensive ablation studies and qualitative analyses to unveil the detailed contribution of each component.
Paper Structure (26 sections, 7 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 7 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of how the proposed semantic-preserved adversarial example is generated and how the explanation consistency is evaluated to update the sample weight of an input image in training.
  • Figure 2: The iterative training pipeline of the proposed framework that improves network interpretability by evaluating the explanation consistency. The arrows represent the work-flow. The category label and the training weight (numerical value) for each image are also illustrated.
  • Figure 3: Qualitative comparison of explanation maps on the STL-10 dataset. (a) and (d) denote the input images; (b) and (e) denote the explanation results generated by the Grad-CAM method on the regularly trained VGG-16 and ResNet-50 models, respectively; (c) and (f) denote the explanation results generated by Grad-CAM on the VGG-16 and ResNet-50 models that are trained via our framework. As shown, the visualization method gives more accurate semantic localization. The results imply that models trained with our framework have better interpretability than the baseline models.
  • Figure 4: Some examples with high (left) or low (right) explanation consistency from the STL-10 benchmark, obtained by the ResNet-50 model trained with our framework. (a) the input image; (b) the initial explanation result; (c) the semantic-preserved adversarial image; (d) the new explanation result.
  • Figure 5: Qualitative comparison of visualization of filters in top convolutional layers of an interpretable CNN (icCNN). Visualizations are generated following the setup in icnn to ensure a fair comparison.
  • ...and 4 more figures