From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation
Geraldin Nanfack, Michael Eickenberg, Eugene Belilovsky
TL;DR
This paper studies the stability of mechanistic interpretability methods for vision models under adversarial model manipulation. It introduces ProxPulse, an attack that simultaneously fools natural and synthetic feature visualizations, and observes partial robustness of visual circuits under this attack. To directly target circuits, the authors propose CircuitBreaker, a loss that preserves circuit head functionality while degrading top-attribution channels, and validate it with strong metrics that show substantial circuit distortion without accuracy loss. The findings reveal that while some circuits exhibit robustness to visualization manipulation, circuit-level interpretations can be manipulated under targeted attacks, underscoring the need for defense strategies in robust circuit discovery and interpretability. The results have implications for deploying interpretable AI in safety-critical settings and motivate further work on defense mechanisms against interpretability attacks.
Abstract
Understanding the inner working functionality of large-scale deep neural networks is challenging yet crucial in several high-stakes applications. Mechanistic inter- pretability is an emergent field that tackles this challenge, often by identifying human-understandable subgraphs in deep neural networks known as circuits. In vision-pretrained models, these subgraphs are usually interpreted by visualizing their node features through a popular technique called feature visualization. Recent works have analyzed the stability of different feature visualization types under the adversarial model manipulation framework. This paper starts by addressing limitations in existing works by proposing a novel attack called ProxPulse that simultaneously manipulates the two types of feature visualizations. Surprisingly, when analyzing these attacks under the umbrella of visual circuits, we find that visual circuits show some robustness to ProxPulse. We, therefore, introduce a new attack based on ProxPulse that unveils the manipulability of visual circuits, shedding light on their lack of robustness. The effectiveness of these attacks is validated using pre-trained AlexNet and ResNet-50 models on ImageNet.
