Automatic Discovery of Visual Circuits
Achyuta Rajaram, Neil Chowdhury, Antonio Torralba, Jacob Andreas, Sarah Schwettmann
TL;DR
This work introduces Cross-Layer Attribution (CLA), a scalable method for automatically discovering visual circuits in deep networks by constructing a subgraph with $k$ neurons per layer that maximizes cross-layer attribution, defined as $Score = |a_m| \cdot |\partial a_n / \partial a_m|$. CLA builds and refines circuits layer-by-layer and enables causal intervention via edge pruning and circuit pruning to test the circuit's effect on outputs. Applying CLA to Inception-CatFish reveals intermediate concept circuits that causally influence predictions and shows that CatFish outputs are composed from reusable concept circuits. The approach extends to defending multimodal models like CLIP by pruning a text-detection circuit, significantly improving robustness to text-based adversarial attacks and demonstrating practical utility for model editing and defense.
Abstract
To date, most discoveries of network subcomponents that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept. We introduce a new method for identifying these subgraphs: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.
