Table of Contents
Fetching ...

Towards better understanding of gradient-based attribution methods for Deep Neural Networks

Marco Ancona, Enea Ceolini, Cengiz Öztireli, Markus Gross

TL;DR

The paper analyzes gradient-based attribution methods for deep neural networks, revealing theoretical and practical connections among Gradient * Input, epsilon-LRP, Integrated Gradients, and DeepLIFT (Rescale). By reformulating two methods within a unified backpropagation framework, it demonstrates conditions under which these approaches are equivalent or approximate, and introduces Sensitivity-n to quantitatively compare attributions. The study combines theoretical insights with empirical evaluations across image and text tasks, showing when individual methods capture global versus local effects and highlighting limitations in complex models. The proposed framework and metric offer a principled basis for selecting and evaluating attribution methods in practice, with implications for interpretability research and trusted AI.

Abstract

Understanding the flow of information in Deep Neural Networks (DNNs) is a challenging problem that has gain increasing attention over the last few years. While several methods have been proposed to explain network predictions, there have been only a few attempts to compare them from a theoretical perspective. What is more, no exhaustive empirical comparison has been performed in the past. In this work, we analyze four gradient-based attribution methods and formally prove conditions of equivalence and approximation between them. By reformulating two of these methods, we construct a unified framework which enables a direct comparison, as well as an easier implementation. Finally, we propose a novel evaluation metric, called Sensitivity-n and test the gradient-based attribution methods alongside with a simple perturbation-based attribution method on several datasets in the domains of image and text classification, using various network architectures.

Towards better understanding of gradient-based attribution methods for Deep Neural Networks

TL;DR

The paper analyzes gradient-based attribution methods for deep neural networks, revealing theoretical and practical connections among Gradient * Input, epsilon-LRP, Integrated Gradients, and DeepLIFT (Rescale). By reformulating two methods within a unified backpropagation framework, it demonstrates conditions under which these approaches are equivalent or approximate, and introduces Sensitivity-n to quantitatively compare attributions. The study combines theoretical insights with empirical evaluations across image and text tasks, showing when individual methods capture global versus local effects and highlighting limitations in complex models. The proposed framework and metric offer a principled basis for selecting and evaluating attribution methods in practice, with implications for interpretability research and trusted AI.

Abstract

Understanding the flow of information in Deep Neural Networks (DNNs) is a challenging problem that has gain increasing attention over the last few years. While several methods have been proposed to explain network predictions, there have been only a few attempts to compare them from a theoretical perspective. What is more, no exhaustive empirical comparison has been performed in the past. In this work, we analyze four gradient-based attribution methods and formally prove conditions of equivalence and approximation between them. By reformulating two of these methods, we construct a unified framework which enables a direct comparison, as well as an easier implementation. Finally, we propose a novel evaluation metric, called Sensitivity-n and test the gradient-based attribution methods alongside with a simple perturbation-based attribution method on several datasets in the domains of image and text classification, using various network architectures.

Paper Structure

This paper contains 20 sections, 4 theorems, 11 equations, 4 figures, 1 table.

Key Result

Proposition 1

$\epsilon$-LRP is equivalent the feature-wise product of the input and the modified partial derivative $\partial^{g} S_c(x) / \partial x_i$, with $g = g^{LRP} = f_i(z_i) / z_i$, i.e. the ratio between the output and the input at each nonlinearity.

Figures (4)

  • Figure 1: Attributions generated by occluding portions of the input image with squared grey patches of different sizes. Notice how the size of the patches influence the result, with focus on the main subject only when using bigger patches.
  • Figure 2: Attribution generated by applying several attribution methods to an Inception V3 network for natural image classification szegedy2016rethinking. Notice how all gradient-based methods produce attributions affected by higher local variance compared to perturbation-based methods (Figure \ref{['fig:box-size']}).
  • Figure 3: Comparison of attribution maps and (a-b) and plot of target output variation as some features are removed from the input image. Best seen in electronic form.
  • Figure 4: Test of Sensitivity-$n$ for several values of $n$, over different tasks and architectures.

Theorems & Definitions (6)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • proof
  • proof