Table of Contents
Fetching ...

Automatic Discovery of Visual Circuits

Achyuta Rajaram, Neil Chowdhury, Antonio Torralba, Jacob Andreas, Sarah Schwettmann

TL;DR

This work introduces Cross-Layer Attribution (CLA), a scalable method for automatically discovering visual circuits in deep networks by constructing a subgraph with $k$ neurons per layer that maximizes cross-layer attribution, defined as $Score = |a_m| \cdot |\partial a_n / \partial a_m|$. CLA builds and refines circuits layer-by-layer and enables causal intervention via edge pruning and circuit pruning to test the circuit's effect on outputs. Applying CLA to Inception-CatFish reveals intermediate concept circuits that causally influence predictions and shows that CatFish outputs are composed from reusable concept circuits. The approach extends to defending multimodal models like CLIP by pruning a text-detection circuit, significantly improving robustness to text-based adversarial attacks and demonstrating practical utility for model editing and defense.

Abstract

To date, most discoveries of network subcomponents that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept. We introduce a new method for identifying these subgraphs: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.

Automatic Discovery of Visual Circuits

TL;DR

This work introduces Cross-Layer Attribution (CLA), a scalable method for automatically discovering visual circuits in deep networks by constructing a subgraph with neurons per layer that maximizes cross-layer attribution, defined as . CLA builds and refines circuits layer-by-layer and enables causal intervention via edge pruning and circuit pruning to test the circuit's effect on outputs. Applying CLA to Inception-CatFish reveals intermediate concept circuits that causally influence predictions and shows that CatFish outputs are composed from reusable concept circuits. The approach extends to defending multimodal models like CLIP by pruning a text-detection circuit, significantly improving robustness to text-based adversarial attacks and demonstrating practical utility for model editing and defense.

Abstract

To date, most discoveries of network subcomponents that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept. We introduce a new method for identifying these subgraphs: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.
Paper Structure (21 sections, 11 figures, 3 algorithms)

This paper contains 21 sections, 11 figures, 3 algorithms.

Figures (11)

  • Figure 1: Automatic discovery of the car circuit inside Inception using CLA. The $\star$ indicates that a unit is present in the weight-based circuit discovered by Olah et al. in olah2020zoom. Maximally activating dataset exemplars are shown for each neuron. CLA recovers all units in the car circuit (units 491, 237, 373 in Layer 4b; unit 447 in Layer 4c) from olah2020zoom, as well as additional car-detecting neurons in all three layers studied. Edge thickness is proportional to Cross-Layer Attribution score.
  • Figure 2: CatFish dataset examples. Each CatFish class composes images from two ImageNet classes. CatFish contains 20 composed classes sampled from 10 ImageNet classes.
  • Figure 3: Intervening on feature composition in Inception-CatFish by edge pruning.(a) CLA-generated circuits for the tabby-joystick (red) and tabby-pajama (blue) output classes. Neurons in both circuits (purple) correspond to the shared tabby intermediate concept (see Figure \ref{['fig:sharedneurons']} for visualization). (b) IoU across neurons selected using CLA, and the maximally activated neurons for a given concept. (c) Predicted impact of intermediate concept knockout. Eliminating a concept through edge pruning will only affect class outputs containing that concept, with no effect on recognition of other concepts. (d) Model accuracy on positive CatFish classes. (e) Model accuracy on negative classes. (f) Inception-CatFish prediction logits before and after pruning each concept circuit, on images sampled from a class containing the concept.
  • Figure 4: Samples from the Traffic Light dataset. The dataset includes real traffic light images, ImageNet with overlaid traffic light text, and adversarial text-attacked images.
  • Figure 5: Intervening on CLIP to prevent text-based adversarial attacks. (a) Schematic of the intervention showing that pruning the text circuit defends CLIP from a real-world adversarial attack. (b) Adversarial accuracy as a function of circuit width when intervening on three CLIP layers (c) CLA text circuit for layer 3. Residual connections between equivalent channels are shown in black.
  • ...and 6 more figures