Table of Contents
Fetching ...

What is Missing? Explaining Neurons Activated by Absent Concepts

Robin Hesse, Simone Schaub-Meyer, Janina Hesse, Bernt Schiele, Stefan Roth

TL;DR

This work proposes two simple extensions to attribution and feature visualization techniques that uncover encoded absences, and shows how mainstream XAI methods can be used to reveal and explain encoded absences, how ImageNet models exploit them, and that debiasing can be improved when considering them.

Abstract

Explainable artificial intelligence (XAI) aims to provide human-interpretable insights into the behavior of deep neural networks (DNNs), typically by estimating a simplified causal structure of the model. In existing work, this causal structure often includes relationships where the presence of a concept is associated with a strong activation of a neuron. For example, attribution methods primarily identify input pixels that contribute most to a prediction, and feature visualization methods reveal inputs that cause high activation of a target neuron - the former implicitly assuming that the relevant information resides in the input, and the latter that neurons encode the presence of concepts. However, a largely overlooked type of causal relationship is that of encoded absences, where the absence of a concept increases neural activation. In this work, we show that such missing but relevant concepts are common and that mainstream XAI methods struggle to reveal them when applied in their standard form. To address this, we propose two simple extensions to attribution and feature visualization techniques that uncover encoded absences. Across experiments, we show how mainstream XAI methods can be used to reveal and explain encoded absences, how ImageNet models exploit them, and that debiasing can be improved when considering them.

What is Missing? Explaining Neurons Activated by Absent Concepts

TL;DR

This work proposes two simple extensions to attribution and feature visualization techniques that uncover encoded absences, and shows how mainstream XAI methods can be used to reveal and explain encoded absences, how ImageNet models exploit them, and that debiasing can be improved when considering them.

Abstract

Explainable artificial intelligence (XAI) aims to provide human-interpretable insights into the behavior of deep neural networks (DNNs), typically by estimating a simplified causal structure of the model. In existing work, this causal structure often includes relationships where the presence of a concept is associated with a strong activation of a neuron. For example, attribution methods primarily identify input pixels that contribute most to a prediction, and feature visualization methods reveal inputs that cause high activation of a target neuron - the former implicitly assuming that the relevant information resides in the input, and the latter that neurons encode the presence of concepts. However, a largely overlooked type of causal relationship is that of encoded absences, where the absence of a concept increases neural activation. In this work, we show that such missing but relevant concepts are common and that mainstream XAI methods struggle to reveal them when applied in their standard form. To address this, we propose two simple extensions to attribution and feature visualization techniques that uncover encoded absences. Across experiments, we show how mainstream XAI methods can be used to reveal and explain encoded absences, how ImageNet models exploit them, and that debiasing can be improved when considering them.
Paper Structure (28 sections, 1 theorem, 4 equations, 12 figures, 4 tables)

This paper contains 28 sections, 1 theorem, 4 equations, 12 figures, 4 tables.

Key Result

Proposition 2.2

DNNs can implement neurons $z_j$ that encode the absence of a concept $\hat{x}$ in the input context of $x$.

Figures (12)

  • Figure 1: Encoded absence in image classification.(a) The model detects concepts present in the input image that are prototypical for the target class (e.g., the snout and feet). (b) The model can additionally encode the absence of snouts from other dog species to enhance evidence for the "Irish setter" class.
  • Figure 1: Quantitative evaluation of encoded absences. We measure the activation of the 100 highest-activating images when inserting none, random, encoded logical $\operatorname{NOT}$s Mu:2020:CEN, or least activating $48\times 48$ patches in a random corner.
  • Figure 2: A mechanistic process to encode the absence of a concept. A neuron encoding the absence of concept $\hat{x}$ (i.e.$\neg \hat{x}$) can be implemented by having a negative connection to a neuron encoding $\hat{x}$ and a positive potential through, e.g., another activating concept $\tilde{x}$ (i.e., the output encodes $\tilde{x} \land \neg \hat{x}$).
  • Figure 3: Hassenstein-Reichardt detector experiment.(a) Two example sequences showing a left-to-right and bi-directional movement. (b) A hand-crafted CNN to distinguish left-to-right motion from bi-directional motion. The first layer implements the spatio-temporal comparison of neighboring pixels, the second layer compares motion in opposing directions, followed by global average pooling (GAP). The first output node implements a Hassenstein-Reichardt detector (weights: 1/-1) and the second output averages both directions (weights: 0.5/0.5). (c) The outputs of the model for the two example sequences. (d) Visualizations of established XAI methods -- target attribution and feature visualization for the highest activating patches, each consisting of two consecutive frames as CNN input (numbers indicate the activation strength). Both methods fail to highlight the absence encoded in the first output and thus lack a complete explanation of the CNN mechanisms. (e) Our proposed non-target attributions and feature visualization through minimization highlight that the first output encodes the absence of right-to-left motion.
  • Figure 4: Toy experiment.(a) Example RGB images from class 1 (green pixel) and class 2 (no green pixel) -- zoom in for better visibility. (b) Architecture of the used toy model with the trained weights. (c) Average logit output for the two output nodes for images from class 1 and 2. Confidence intervals represent two times the standard deviation. (d) Integrated Gradients Sundararajan:2017:AAD target attributions for the above example images, and maximally activating patches for the two output nodes. (e)Non-target attributions for the two respective examples -- note how the attributions switch from positive (blue) to negative (red) -- and minimally activating patches for the two output nodes (numbers indicate the activation strength).
  • ...and 7 more figures

Theorems & Definitions (3)

  • Definition 2.1: Encoded Absence
  • Proposition 2.2
  • Definition 1.1: Encoded Absence for a Feature-Space Direction