Table of Contents
Fetching ...

Manipulating Feature Visualizations with Gradient Slingshots

Dilyara Bareeva, Marina M. -C. Höhne, Alexander Warnecke, Lukas Pirch, Klaus-Robert Müller, Konrad Rieck, Sebastian Lapuschkin, Kirill Bykov

TL;DR

This work reveals a vulnerability in Feature Visualization by introducing Gradient Slingshots (GS), a fine-tuning-based method that can steer Activation Maximization toward arbitrary target images without altering model architecture or drastically affecting performance. The authors formulate a theoretical framework using slingshot, landing, and tunnel regions to guarantee convergence of the FV optimization to a predefined target, and implement GS with manipulation and preservation losses to control the trajectory while maintaining internal representations. Comprehensive experiments across CNNs and Vision Transformers, including a case study on weapon-detection with CLIP, demonstrate that GS can fabricate faithful-looking FV explanations for arbitrary targets, potentially deceiving auditors. The paper also proposes a simple defense based on analyzing natural AM signals to detect manipulated FVs and discusses limitations and broader societal considerations. This work underscores the need for robust FV validation and more resilient interpretability techniques in high-stakes AI systems.

Abstract

Feature Visualization (FV) is a widely used technique for interpreting concepts learned by Deep Neural Networks (DNNs), which synthesizes input patterns that maximally activate a given feature. Despite its popularity, the trustworthiness of FV explanations has received limited attention. We introduce Gradient Slingshots, a novel method that enables FV manipulation without modifying model architecture or significantly degrading performance. By shaping new trajectories in off-distribution regions of a feature's activation landscape, we coerce the optimization process to converge to a predefined visualization. We evaluate our approach on several DNN architectures, demonstrating its ability to replace faithful FVs with arbitrary targets. These results expose a critical vulnerability: auditors relying solely on FV may accept entirely fabricated explanations. To mitigate this risk, we propose a straightforward defense and quantitatively demonstrate its effectiveness.

Manipulating Feature Visualizations with Gradient Slingshots

TL;DR

This work reveals a vulnerability in Feature Visualization by introducing Gradient Slingshots (GS), a fine-tuning-based method that can steer Activation Maximization toward arbitrary target images without altering model architecture or drastically affecting performance. The authors formulate a theoretical framework using slingshot, landing, and tunnel regions to guarantee convergence of the FV optimization to a predefined target, and implement GS with manipulation and preservation losses to control the trajectory while maintaining internal representations. Comprehensive experiments across CNNs and Vision Transformers, including a case study on weapon-detection with CLIP, demonstrate that GS can fabricate faithful-looking FV explanations for arbitrary targets, potentially deceiving auditors. The paper also proposes a simple defense based on analyzing natural AM signals to detect manipulated FVs and discusses limitations and broader societal considerations. This work underscores the need for robust FV validation and more resilient interpretability techniques in high-stakes AI systems.

Abstract

Feature Visualization (FV) is a widely used technique for interpreting concepts learned by Deep Neural Networks (DNNs), which synthesizes input patterns that maximally activate a given feature. Despite its popularity, the trustworthiness of FV explanations has received limited attention. We introduce Gradient Slingshots, a novel method that enables FV manipulation without modifying model architecture or significantly degrading performance. By shaping new trajectories in off-distribution regions of a feature's activation landscape, we coerce the optimization process to converge to a predefined visualization. We evaluate our approach on several DNN architectures, demonstrating its ability to replace faithful FVs with arbitrary targets. These results expose a critical vulnerability: auditors relying solely on FV may accept entirely fabricated explanations. To mitigate this risk, we propose a straightforward defense and quantitatively demonstrate its effectiveness.
Paper Structure (54 sections, 1 theorem, 13 equations, 20 figures, 21 tables)

This paper contains 54 sections, 1 theorem, 13 equations, 20 figures, 21 tables.

Key Result

Lemma 3.1

Assuming that $\bm{q}^{(0)} \in \mathbb{B}$, the FV optimization sequence $\bm{q^{(i)}}$ (eq: fv) converges to the target point $\bm{q^t}$, i.e., $\lim_{i \to \infty} \bm{q^{(i)}} = \bm{q^t}$, when $\mathbb{M} = \mathbb{T}_{B, L}$, the step size $\epsilon < \frac{1}{\gamma}$, and $r=\mathrm{id}$, i.

Figures (20)

  • Figure 1: The Gradient Slingshots method manipulates the visualization for a given feature. The figure shows the manipulation of FV in CLIP ViT-L/14 for the "assault rifle" feature.
  • Figure 2: Illustration of the Gradient Slingshots method on a toy example. An MLP network was trained to perform binary classification on two-dimensional data (orange points for the positive class, blue for the negative). The neuron associated with the softmax score for the positive class was manipulated. The figures, from left to right: A) the activation landscape of the original neuron, with designated points $\bm{\tilde{q}}$ and $\bm{q^t}$, B) "slingshot", "landing" and "tunnel" zones, C) the activation landscape after manipulation including a cross-section plane between the two points. The manipulated function in the "tunnel" zone exhibits a parabolic form (as in \ref{['eq: parabola']}).
  • Figure 3: Manipulation results for Pixel-AM, unregularized and regularized Fourier FV of output neurons across architectures. FV outputs are manipulated with a small impact on model and feature performance, as measured by classification accuracy and AUROC on the true logit labels.
  • Figure 4: Sample FVs and their similarity to the target image (a Dalmatian) at different values of $\alpha$ for ResNet-50 . Both very low and very high values of $\alpha$ result in low similarity to the target.
  • Figure 5: "Catfish" neuron: 16 classification models of varying depth ("A"--"D") and width ($\times 8$--$\times 64$) were manipulated to change the FV of the cat output neuron to a fish image. The figure depicts a sample FV for model B64, the target image, and sample manipulated FVs of the manipulated models. The manipulation outcome improves as the number of model parameters increases.
  • ...and 15 more figures

Theorems & Definitions (2)

  • Lemma 3.1
  • proof : Proof of Lemma 3.1