The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision
Liv Gorton
TL;DR
This paper tackles polysemantic neurons in InceptionV1 by applying sparse autoencoders to the model's early vision stages to reveal interpretable features and missing curve detectors. The authors train SAEs on activation samples from ImageNet (ILSVRC), decomposing activations into a sparse set of feature directions and a bias, and they analyze results with dataset examples and feature visualisation. They show that SAEs uncover new curve detectors that fill gaps, and they demonstrate that some polysemantic neurons can be decomposed into monosemantic components, including a case where a double-curve detector splits into multiple features. Overall, SAEs emerge as a valuable tool for mechanistic interpretability in convolutional nets like InceptionV1, with potential applicability to broader architectures and interpretability workflows.
Abstract
Recent work on sparse autoencoders (SAEs) has shown promise in extracting interpretable features from neural networks and addressing challenges with polysemantic neurons caused by superposition. In this paper, we apply SAEs to the early vision layers of InceptionV1, a well-studied convolutional neural network, with a focus on curve detectors. Our results demonstrate that SAEs can uncover new interpretable features not apparent from examining individual neurons, including additional curve detectors that fill in previous gaps. We also find that SAEs can decompose some polysemantic neurons into more monosemantic constituent features. These findings suggest SAEs are a valuable tool for understanding InceptionV1, and convolutional neural networks more generally.
