From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation

Geraldin Nanfack; Michael Eickenberg; Eugene Belilovsky

From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation

Geraldin Nanfack, Michael Eickenberg, Eugene Belilovsky

TL;DR

This paper studies the stability of mechanistic interpretability methods for vision models under adversarial model manipulation. It introduces ProxPulse, an attack that simultaneously fools natural and synthetic feature visualizations, and observes partial robustness of visual circuits under this attack. To directly target circuits, the authors propose CircuitBreaker, a loss that preserves circuit head functionality while degrading top-attribution channels, and validate it with strong metrics that show substantial circuit distortion without accuracy loss. The findings reveal that while some circuits exhibit robustness to visualization manipulation, circuit-level interpretations can be manipulated under targeted attacks, underscoring the need for defense strategies in robust circuit discovery and interpretability. The results have implications for deploying interpretable AI in safety-critical settings and motivate further work on defense mechanisms against interpretability attacks.

Abstract

Understanding the inner working functionality of large-scale deep neural networks is challenging yet crucial in several high-stakes applications. Mechanistic inter- pretability is an emergent field that tackles this challenge, often by identifying human-understandable subgraphs in deep neural networks known as circuits. In vision-pretrained models, these subgraphs are usually interpreted by visualizing their node features through a popular technique called feature visualization. Recent works have analyzed the stability of different feature visualization types under the adversarial model manipulation framework. This paper starts by addressing limitations in existing works by proposing a novel attack called ProxPulse that simultaneously manipulates the two types of feature visualizations. Surprisingly, when analyzing these attacks under the umbrella of visual circuits, we find that visual circuits show some robustness to ProxPulse. We, therefore, introduce a new attack based on ProxPulse that unveils the manipulability of visual circuits, shedding light on their lack of robustness. The effectiveness of these attacks is validated using pre-trained AlexNet and ResNet-50 models on ImageNet.

From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation

TL;DR

Abstract

Paper Structure (25 sections, 9 equations, 35 figures, 2 tables)

This paper contains 25 sections, 9 equations, 35 figures, 2 tables.

Introduction
Related Work
Mechanistic Interpretability.
Notations and Background
Methods
Manipulation of Feature Visualization
Manipulation of Visual Circuits
Experimental Evaluation
ProxPulse Simultaneously Fools Natural and Synthetic Feature Visualization
ProxPulse Has a Minor Effect on Channel Attribution Ranks of Visual Circuits
Manipulation of the Circuit through CircuitBreaker
Conclusion and Limitations
Appendix / supplemental material
Broader Impact
Further Experimental Details
...and 10 more sections

Figures (35)

Figure 1: Illustration of the manipulability of both natural and synthetic feature visualization using ProxPulse on conv5 and conv4 of AlexNet. The first row (resp. second row) shows the natural initial (resp. final) feature visualization and initial (resp. final) synthetic feature visualizations. On the image title, we report the corresponding metrics to evaluate change in top activating inputs. One can observe that both natural and synthetic feature visualization have completely changed, to very similar images for the synthetic one. Observe that as intended, conv4 synthetic images are different from those of conv5, although the same target images have been used for $\mathcal{D}_{\text{fool}}$.
Figure 2: Histogram of pairwise cosine similarities between CLIP features of non-noisy synthetic images before (initial) and after (final) ProxPulse.
Figure 3: Illustration of the non-effectiveness of ProxPulse to manipulate the circuit. We show two visual circuits drawn for circuit head conv5:37 on pre-trained AlexNet (left) and on the fine-tuned AlexNet with ProxPulse (right) on conv5. We observe that most of the channels (at least two per layer, see surrounded ones) on the circuit were not removed by ProxPulse, even though some of them (e.g., channel conv5:151) has visually changed.
Figure 4: Visual circuit with sparsity 0.3 for conv5:37 after fine-tuning with ProxPulse on AlexNet. We observe that the final synthetic feature visualization of the circuit head with sparsity 0.3 is similar to the initial one in Fig. \ref{['fig:circuit_init_final_features_10_37_failure']}), although with sparsity 1 this final visualization was completely and visually different from the initial one. Reducing the sparsity has therefore removed the change in feature visualization as can be seen by the absence of patterns added by ProxPulse in the right circuit of Fig. \ref{['fig:circuit_init_final_features_10_37_failure']}.
Figure 5: Effectiveness of CircuitBreaker to manipulate visual circuits on conv5 of AlexNet. We observe that the circuit visualization is severely distorted while the network outputs change minimally.
...and 30 more figures

From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation

TL;DR

Abstract

From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (35)