Automatic Discovery of Visual Circuits

Achyuta Rajaram; Neil Chowdhury; Antonio Torralba; Jacob Andreas; Sarah Schwettmann

Automatic Discovery of Visual Circuits

Achyuta Rajaram, Neil Chowdhury, Antonio Torralba, Jacob Andreas, Sarah Schwettmann

TL;DR

This work introduces Cross-Layer Attribution (CLA), a scalable method for automatically discovering visual circuits in deep networks by constructing a subgraph with $k$ neurons per layer that maximizes cross-layer attribution, defined as $Score = |a_m| \cdot |\partial a_n / \partial a_m|$. CLA builds and refines circuits layer-by-layer and enables causal intervention via edge pruning and circuit pruning to test the circuit's effect on outputs. Applying CLA to Inception-CatFish reveals intermediate concept circuits that causally influence predictions and shows that CatFish outputs are composed from reusable concept circuits. The approach extends to defending multimodal models like CLIP by pruning a text-detection circuit, significantly improving robustness to text-based adversarial attacks and demonstrating practical utility for model editing and defense.

Abstract

To date, most discoveries of network subcomponents that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept. We introduce a new method for identifying these subgraphs: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.

Automatic Discovery of Visual Circuits

TL;DR

This work introduces Cross-Layer Attribution (CLA), a scalable method for automatically discovering visual circuits in deep networks by constructing a subgraph with

neurons per layer that maximizes cross-layer attribution, defined as

. CLA builds and refines circuits layer-by-layer and enables causal intervention via edge pruning and circuit pruning to test the circuit's effect on outputs. Applying CLA to Inception-CatFish reveals intermediate concept circuits that causally influence predictions and shows that CatFish outputs are composed from reusable concept circuits. The approach extends to defending multimodal models like CLIP by pruning a text-detection circuit, significantly improving robustness to text-based adversarial attacks and demonstrating practical utility for model editing and defense.

Abstract

Paper Structure (21 sections, 11 figures, 3 algorithms)

This paper contains 21 sections, 11 figures, 3 algorithms.

Introduction
Circuit discovery in vision models.
Automated circuit extraction.
Methods
Circuit extraction based on Cross-Layer Attribution (CLA)
Intervention analysis of vision models
Compositional circuits in Inception-CatFish
Constructing a dataset with visual feature hierarchy (CatFish)
CLA identifies intermediate concept circuits that causally affect model output
Pruning an intermediate concept circuit removes that concept from the output distribution
Circuits corresponding to CatFish output classes contain intermediate concept neurons
Circuit pruning defends CLIP from text-based adversarial attacks
Traffic light dataset for benchmarking textual defense
Model intervention protects CLIP from adversarial attacks
Discussion
...and 6 more sections

Figures (11)

Figure 1: Automatic discovery of the car circuit inside Inception using CLA. The $\star$ indicates that a unit is present in the weight-based circuit discovered by Olah et al. in olah2020zoom. Maximally activating dataset exemplars are shown for each neuron. CLA recovers all units in the car circuit (units 491, 237, 373 in Layer 4b; unit 447 in Layer 4c) from olah2020zoom, as well as additional car-detecting neurons in all three layers studied. Edge thickness is proportional to Cross-Layer Attribution score.
Figure 2: CatFish dataset examples. Each CatFish class composes images from two ImageNet classes. CatFish contains 20 composed classes sampled from 10 ImageNet classes.
Figure 3: Intervening on feature composition in Inception-CatFish by edge pruning.(a) CLA-generated circuits for the tabby-joystick (red) and tabby-pajama (blue) output classes. Neurons in both circuits (purple) correspond to the shared tabby intermediate concept (see Figure \ref{['fig:sharedneurons']} for visualization). (b) IoU across neurons selected using CLA, and the maximally activated neurons for a given concept. (c) Predicted impact of intermediate concept knockout. Eliminating a concept through edge pruning will only affect class outputs containing that concept, with no effect on recognition of other concepts. (d) Model accuracy on positive CatFish classes. (e) Model accuracy on negative classes. (f) Inception-CatFish prediction logits before and after pruning each concept circuit, on images sampled from a class containing the concept.
Figure 4: Samples from the Traffic Light dataset. The dataset includes real traffic light images, ImageNet with overlaid traffic light text, and adversarial text-attacked images.
Figure 5: Intervening on CLIP to prevent text-based adversarial attacks. (a) Schematic of the intervention showing that pruning the text circuit defends CLIP from a real-world adversarial attack. (b) Adversarial accuracy as a function of circuit width when intervening on three CLIP layers (c) CLA text circuit for layer 3. Residual connections between equivalent channels are shown in black.
...and 6 more figures

Automatic Discovery of Visual Circuits

TL;DR

Abstract

Automatic Discovery of Visual Circuits

Authors

TL;DR

Abstract

Table of Contents

Figures (11)