Table of Contents
Fetching ...

Understanding Inhibition Through Maximally Tense Images

Chris Hamblin, Srijani Saha, Talia Konkle, George Alvarez

TL;DR

This work addresses the functional role of 'feature inhibition' in vision models, and proposes inhibition be understood through a study of 'maximally tense images' (MTIs), those images that excite and inhibit a given feature simultaneously.

Abstract

We address the functional role of 'feature inhibition' in vision models; that is, what are the mechanisms by which a neural network ensures images do not express a given feature? We observe that standard interpretability tools in the literature are not immediately suited to the inhibitory case, given the asymmetry introduced by the ReLU activation function. Given this, we propose inhibition be understood through a study of 'maximally tense images' (MTIs), i.e. those images that excite and inhibit a given feature simultaneously. We show how MTIs can be studied with two novel visualization techniques; +/- attribution inversions, which split single images into excitatory and inhibitory components, and the attribution atlas, which provides a global visualization of the various ways images can excite/inhibit a feature. Finally, we explore the difficulties introduced by superposition, as such interfering features induce the same attribution motif as MTIs.

Understanding Inhibition Through Maximally Tense Images

TL;DR

This work addresses the functional role of 'feature inhibition' in vision models, and proposes inhibition be understood through a study of 'maximally tense images' (MTIs), those images that excite and inhibit a given feature simultaneously.

Abstract

We address the functional role of 'feature inhibition' in vision models; that is, what are the mechanisms by which a neural network ensures images do not express a given feature? We observe that standard interpretability tools in the literature are not immediately suited to the inhibitory case, given the asymmetry introduced by the ReLU activation function. Given this, we propose inhibition be understood through a study of 'maximally tense images' (MTIs), i.e. those images that excite and inhibit a given feature simultaneously. We show how MTIs can be studied with two novel visualization techniques; +/- attribution inversions, which split single images into excitatory and inhibitory components, and the attribution atlas, which provides a global visualization of the various ways images can excite/inhibit a feature. Finally, we explore the difficulties introduced by superposition, as such interfering features induce the same attribution motif as MTIs.
Paper Structure (26 sections, 7 equations, 20 figures)

This paper contains 26 sections, 7 equations, 20 figures.

Figures (20)

  • Figure 1: How might a network construct an accurate 'banana' feature, that doesn't activate for duckbills?
  • Figure 2: Imagenet imagenet_cvpr09 validation dataset example MEI and MII images for random features across several layers of InceptionV1 inception. For each layer, the top row images correspond to MEIs/MIIs for a unit in that layer. The bottom row images correspond to a feature direction identified with k-means clustering. For both unit and k-means features, MEIs and their respective MIIs seem relatable in early layers, but arbitrarily paired in later layers.
  • Figure 3: The correlation between logits, $\bm{f}_{\bm{v}}$, and total attribution $E_{l}$ measured across layers. $\bm{f}_{\bm{v}} \approx E_{l}$ across all layers, except when measured through very early layers and pixels.
  • Figure 4: A. A scatterplot of $E^{+}_{l}$ and $E^{-}_{l}$ for a proposed 'curve detector' unit, across validation set images. Selected images visualized in B.-F. are circled in with the corresponding color. B. shows MEI examples, C. MIIs, and D. images with no attribution. E. shows images with positive and negative attributions in different spatial location, while F. shows images with positive and negative attribution in the channel dimension, at the same spatial location. G. The colorscale used for the $(\bm{\varphi}^{+}_{l},\bm{\varphi}^{-}_{l})$ cam maps, which spatialize the positive and negative attribution in a given image.
  • Figure 5: MTIs for 3 units, with their $\pm$ attribution accentuations, inversions, and standard feature visualizations
  • ...and 15 more figures