Manipulating Feature Visualizations with Gradient Slingshots
Dilyara Bareeva, Marina M. -C. Höhne, Alexander Warnecke, Lukas Pirch, Klaus-Robert Müller, Konrad Rieck, Sebastian Lapuschkin, Kirill Bykov
TL;DR
This work reveals a vulnerability in Feature Visualization by introducing Gradient Slingshots (GS), a fine-tuning-based method that can steer Activation Maximization toward arbitrary target images without altering model architecture or drastically affecting performance. The authors formulate a theoretical framework using slingshot, landing, and tunnel regions to guarantee convergence of the FV optimization to a predefined target, and implement GS with manipulation and preservation losses to control the trajectory while maintaining internal representations. Comprehensive experiments across CNNs and Vision Transformers, including a case study on weapon-detection with CLIP, demonstrate that GS can fabricate faithful-looking FV explanations for arbitrary targets, potentially deceiving auditors. The paper also proposes a simple defense based on analyzing natural AM signals to detect manipulated FVs and discusses limitations and broader societal considerations. This work underscores the need for robust FV validation and more resilient interpretability techniques in high-stakes AI systems.
Abstract
Feature Visualization (FV) is a widely used technique for interpreting concepts learned by Deep Neural Networks (DNNs), which synthesizes input patterns that maximally activate a given feature. Despite its popularity, the trustworthiness of FV explanations has received limited attention. We introduce Gradient Slingshots, a novel method that enables FV manipulation without modifying model architecture or significantly degrading performance. By shaping new trajectories in off-distribution regions of a feature's activation landscape, we coerce the optimization process to converge to a predefined visualization. We evaluate our approach on several DNN architectures, demonstrating its ability to replace faithful FVs with arbitrary targets. These results expose a critical vulnerability: auditors relying solely on FV may accept entirely fabricated explanations. To mitigate this risk, we propose a straightforward defense and quantitatively demonstrate its effectiveness.
