Don't trust your eyes: on the (un)reliability of feature visualizations

Robert Geirhos; Roland S. Zimmermann; Blair Bilodeau; Wieland Brendel; Been Kim

Don't trust your eyes: on the (un)reliability of feature visualizations

Robert Geirhos, Roland S. Zimmermann, Blair Bilodeau, Wieland Brendel, Been Kim

TL;DR

The paper questions the reliability of feature visualizations (activation maximization) as explanations of neural behavior, demonstrating that visualizations can be arbitrarily manipulated with fooling circuits or silent units without altering natural-input performance. An empirical sanity check shows that visualizations traverse different processing paths than natural inputs for most network layers, casting doubt on their explanatory value. The authors provide a theoretical no-go framework showing that, without strong structural assumptions about the function, no decoder can reliably predict a network’s output from its maximal/minimal visualizations; reliability only emerges under very restrictive conditions. The work suggests designing networks with interpretability-enabling structures or exploring alternative visualization paradigms to achieve trustworthy mechanistic insights.

Abstract

How do neural networks extract patterns from pixels? Feature visualizations attempt to answer this important question by visualizing highly activating patterns through optimization. Today, visualization methods form the foundation of our knowledge about the internal workings of neural networks, as a type of mechanistic interpretability. Here we ask: How reliable are feature visualizations? We start our investigation by developing network circuits that trick feature visualizations into showing arbitrary patterns that are completely disconnected from normal network behavior on natural input. We then provide evidence for a similar phenomenon occurring in standard, unmanipulated networks: feature visualizations are processed very differently from standard input, casting doubt on their ability to "explain" how neural networks process natural images. This can be used as a sanity check for feature visualizations. We underpin our empirical findings by theory proving that the set of functions that can be reliably understood by feature visualization is extremely small and does not include general black-box neural networks. Therefore, a promising way forward could be the development of networks that enforce certain structures in order to ensure more reliable feature visualizations.

Don't trust your eyes: on the (un)reliability of feature visualizations

TL;DR

Abstract

Paper Structure (57 sections, 13 theorems, 37 equations, 21 figures, 3 tables)

This paper contains 57 sections, 13 theorems, 37 equations, 21 figures, 3 tables.

Introduction
Adversarial perspective: Can feature visualizations be fooled?
Manipulating feature visualizations through a fooling circuit
Manipulating feature visualizations by leveraging silent units
Empirical perspective: How can we sanity-check feature visualizations?
Theoretical perspective: Under which circumstances is feature visualization guaranteed to be reliable?
Notation and definitions.
Main theoretical results
Conclusion
Appendix
Literature
Related work on deceiving interpretability methods
Literature expectations about feature visualization
Relationship to highly activating natural samples
Proofs and theory details
...and 42 more sections

Key Result

Proposition 1

There exists $D\in\mathcal{D}$ such that for all $f\in\mathcal{F}$,

Figures (21)

Figure 1: Arbitrary feature visualizations. Don't trust your eyes: Feature visualizations can be arbitrarily manipulated by embedding a fooling circuit in a network, which changes visualizations while maintaining the original network's ImageNet accuracy. Left: Original feature visualizations. Right: In a network with a fooling circuit as described in \ref{['subsec:fooling_circuit']}, feature visualizations can be tricked into visualizing arbitrary patterns (e.g., Mona Lisa).
Figure 2: Using a fooling circuit to arbitrarily permute visualizations.Top row: Visualizations of the last-layer units in the original Inception-V1 model. Bottom row: After integrating a fooling circuit as described in \ref{['subsec:fooling_circuit']}, units show an arbitrarily permuted visualization (here: offset by 100 indices).
Figure 3: Fooling circuit. This circuit consists of six units. Unit $A$ responds like unit $F$ for natural images, but the feature visualizations of $A$ are identical to the ones of $D$. This is achieved by a classifier unit ($E$) distinguishing between natural and visualization input, and two intermediate units with ReLU nonlinearities ($B$ and $C$) selectively suppressing information depending on the classifier's output. $k$ is an arbitrary large constant that ensures the gradient flows only through either the left or the right part of the circuit, not both, by pushing $B$ or $C$'s pre-ReLU activation below zero.
Figure 4: Leveraging silent units to produce identical visualizations throughout a layer. The top row shows feature visualizations for units of a layer (block 4-1, conv 2) in a standard, unmanipulated ResNet-50. For the bottom row, we manipulate the model such that the feature visualizations of all units become near-identical (indicated by the red box). Nevertheless, the units still perform the same computations as in the original model on natural input, as evident by an unchanged validation loss. This is achieved by leveraging orthogonal filters in silent units as described in \ref{['subsec:silent_unit_manipulation']}.
Figure 5: Sanity check: Feature visualizations are processed differently than natural images. Feature visualizations are designed to explain how neural networks process natural input---but do feature visualizations for a certain class actually activate similar units as natural input from this class? We measure the similarity of a layer's activations caused by natural images and feature visualizations across layers. Throughout the first two thirds of Inception-V1 layers, activations of natural images have roughly as little similarity to same-class visualizations as they have to completely arbitrary images of different classes. In the last third of the network, similarity increases. Layer annotations (e.g., "textures") are from olah2017feature.
...and 16 more figures

Theorems & Definitions (20)

Proposition 1
Theorem 1
Theorem 2
Lemma 1
proof : Proof of \ref{['lem:simple_relu']}
Lemma 2
proof : Proof of \ref{['lem:network']}
Remark 1
Theorem 3
Theorem 4
...and 10 more

Don't trust your eyes: on the (un)reliability of feature visualizations

TL;DR

Abstract

Don't trust your eyes: on the (un)reliability of feature visualizations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (20)