Table of Contents
Fetching ...

The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision

Liv Gorton

TL;DR

This paper tackles polysemantic neurons in InceptionV1 by applying sparse autoencoders to the model's early vision stages to reveal interpretable features and missing curve detectors. The authors train SAEs on activation samples from ImageNet (ILSVRC), decomposing activations into a sparse set of feature directions and a bias, and they analyze results with dataset examples and feature visualisation. They show that SAEs uncover new curve detectors that fill gaps, and they demonstrate that some polysemantic neurons can be decomposed into monosemantic components, including a case where a double-curve detector splits into multiple features. Overall, SAEs emerge as a valuable tool for mechanistic interpretability in convolutional nets like InceptionV1, with potential applicability to broader architectures and interpretability workflows.

Abstract

Recent work on sparse autoencoders (SAEs) has shown promise in extracting interpretable features from neural networks and addressing challenges with polysemantic neurons caused by superposition. In this paper, we apply SAEs to the early vision layers of InceptionV1, a well-studied convolutional neural network, with a focus on curve detectors. Our results demonstrate that SAEs can uncover new interpretable features not apparent from examining individual neurons, including additional curve detectors that fill in previous gaps. We also find that SAEs can decompose some polysemantic neurons into more monosemantic constituent features. These findings suggest SAEs are a valuable tool for understanding InceptionV1, and convolutional neural networks more generally.

The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision

TL;DR

This paper tackles polysemantic neurons in InceptionV1 by applying sparse autoencoders to the model's early vision stages to reveal interpretable features and missing curve detectors. The authors train SAEs on activation samples from ImageNet (ILSVRC), decomposing activations into a sparse set of feature directions and a bias, and they analyze results with dataset examples and feature visualisation. They show that SAEs uncover new curve detectors that fill gaps, and they demonstrate that some polysemantic neurons can be decomposed into monosemantic components, including a case where a double-curve detector splits into multiple features. Overall, SAEs emerge as a valuable tool for mechanistic interpretability in convolutional nets like InceptionV1, with potential applicability to broader architectures and interpretability workflows.

Abstract

Recent work on sparse autoencoders (SAEs) has shown promise in extracting interpretable features from neural networks and addressing challenges with polysemantic neurons caused by superposition. In this paper, we apply SAEs to the early vision layers of InceptionV1, a well-studied convolutional neural network, with a focus on curve detectors. Our results demonstrate that SAEs can uncover new interpretable features not apparent from examining individual neurons, including additional curve detectors that fill in previous gaps. We also find that SAEs can decompose some polysemantic neurons into more monosemantic constituent features. These findings suggest SAEs are a valuable tool for understanding InceptionV1, and convolutional neural networks more generally.
Paper Structure (14 sections, 5 figures, 1 table)

This paper contains 14 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: This figure presents examples of one interpretable feature learned by the SAE at each layer. For each feature, we present dataset examples at varying activation levels and a feature visualisation. We also show the neuron most similar to this feature, and corresponding dataset examples and feature visualisations.
  • Figure 2: Left & Middle: Synthetic data plots, showing how curve detector features (left) and neurons (middle) respond to synthetic curve stimuli, as in cammarata2020curve. Activations are denominated in standard deviations. A subset of the new curve detectors are shown, with gaps between the neurons representing a previously missing curve detector. Right: The same data is shown on radial tuning curve plots, again following cammarata2020curve. The curve radius at a given orientation denotes activation, again measured in standard deviations of activation.
  • Figure 3: Top: The three SAE features most strongly weighted to mixed3b/n/359, previously identified by olah2020an as a double curve detector that was likely polysemantic. Bottom: Max activations of analogous features across SAEs with different L1 coefficients. As L1 increases, the double curve feature becomes smaller, while the left and right curves correspondingly grow.
  • Figure 4: The extent to which each learned feature of a sparse autoencoder trained across all branches of mixed3b, is represented by neurons on the $5 \times 5$ branch. Negative weights were excluded when computing the norm. Curve detectors are amongst the features with the greatest branch specialisation.
  • Figure 5: The extent to which each learned feature of sparse autoencoders trained across all branches of mixed3b, mixed4e, and mixed5b are represented by neurons on the $5 \times 5$ branch. Negative weights were excluded when computing the norm.