Table of Contents
Fetching ...

A Learning Paradigm for Interpretable Gradients

Felipe Torres Figueroa, Hanwei Zhang, Ronan Sicre, Yannis Avrithis, Stephane Ayache

TL;DR

This work targets the interpretability of CNNs by improving the quality of gradient-based explanations. It introduces a training-time regularization that aligns the standard input gradient with the guided backpropagation gradient, yielding a total loss $L = L_C + \lambda L_R$ and enabling sharper, less noisy saliency maps without altering inference. Through experiments on CIFAR-100 with ResNet-18 and MobileNet-V2, the approach improves faithfulness and causal metrics across multiple CAM-based methods while maintaining or slightly improving accuracy. The results demonstrate that regularizing gradients during training enhances interpretability methods’ effectiveness, offering a practical, inference-preserving pathway to more faithful explanations. The method is compatible with several saliency frameworks (e.g., CAM, Grad-CAM, Score-CAM) and provides a scalable alternative to time-consuming inference-time denoising techniques like SmoothGrad.

Abstract

This paper studies interpretability of convolutional networks by means of saliency maps. Most approaches based on Class Activation Maps (CAM) combine information from fully connected layers and gradient through variants of backpropagation. However, it is well understood that gradients are noisy and alternatives like guided backpropagation have been proposed to obtain better visualization at inference. In this work, we present a novel training approach to improve the quality of gradients for interpretability. In particular, we introduce a regularization loss such that the gradient with respect to the input image obtained by standard backpropagation is similar to the gradient obtained by guided backpropagation. We find that the resulting gradient is qualitatively less noisy and improves quantitatively the interpretability properties of different networks, using several interpretability methods.

A Learning Paradigm for Interpretable Gradients

TL;DR

This work targets the interpretability of CNNs by improving the quality of gradient-based explanations. It introduces a training-time regularization that aligns the standard input gradient with the guided backpropagation gradient, yielding a total loss and enabling sharper, less noisy saliency maps without altering inference. Through experiments on CIFAR-100 with ResNet-18 and MobileNet-V2, the approach improves faithfulness and causal metrics across multiple CAM-based methods while maintaining or slightly improving accuracy. The results demonstrate that regularizing gradients during training enhances interpretability methods’ effectiveness, offering a practical, inference-preserving pathway to more faithful explanations. The method is compatible with several saliency frameworks (e.g., CAM, Grad-CAM, Score-CAM) and provides a scalable alternative to time-consuming inference-time denoising techniques like SmoothGrad.

Abstract

This paper studies interpretability of convolutional networks by means of saliency maps. Most approaches based on Class Activation Maps (CAM) combine information from fully connected layers and gradient through variants of backpropagation. However, it is well understood that gradients are noisy and alternatives like guided backpropagation have been proposed to obtain better visualization at inference. In this work, we present a novel training approach to improve the quality of gradients for interpretability. In particular, we introduce a regularization loss such that the gradient with respect to the input image obtained by standard backpropagation is similar to the gradient obtained by guided backpropagation. We find that the resulting gradient is qualitatively less noisy and improves quantitatively the interpretability properties of different networks, using several interpretability methods.
Paper Structure (36 sections, 13 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 36 sections, 13 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Interpretable gradient learning. For an input image $x$, we obtain the logit vector $y = f(x; \theta)$ by a forward pass through the network $f$ with parameters $\theta$. We compute the classification loss $L_C$ by softmax and cross-entropy (\ref{['eq:class']}), (\ref{['eq:ce']}). We obtain the standard gradient $\partial^{}{L_C}/\partial{x^{}}$ and guided gradient $\partial^{}{_G L_C}/\partial{x^{}}$ by two backward passes (dashed) and compute the regularization loss $L_R$ as the error between the two (\ref{['eq:reg']}),(\ref{['eq:mae']})-(\ref{['eq:cos']}). The total loss is $L = L_C + \lambda L_R$ (\ref{['eq:total']}). Learning is based on $\partial^{}{L}/\partial{\theta^{}}$, which involves differentiation of the entire computational graph except the guided backpropagation branch (blue).
  • Figure 2: Saliency map comparison of standard vs. our training using different CAM-based methods on CIFAR-100 examples.
  • Figure 3: Gradient comparison of standard vs. our training on CIFAR-100 examples.