Table of Contents
Fetching ...

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Ziyu Liu, Sanmi Koyejo

TL;DR

The paper investigates why adversarial transfer differs between data-space and representation-space attacks. It introduces a mathematical model showing perfect transfer for data-space attacks between functionally equivalent networks, while representation-space attacks require stringent geometric alignment, with transfer vanishing in high dimensions otherwise. Empirically, data-space attacks transfer for image classifiers and vision-language models, whereas representation-space attacks rarely transfer unless latent geometries are closely aligned; text-based attacks readily transfer across language-model families, but soft-prompt attacks typically do not unless representations align. The work further demonstrates that, under geometric alignment, representation-space attacks can transfer in both language and vision-language models, highlighting latent-space structure as a key determinant. These insights have practical implications for designing robust multimodal systems and understanding when adversarial attacks may generalize across models.

Abstract

The field of adversarial robustness has long established that adversarial examples can successfully transfer between image classifiers and that text jailbreaks can successfully transfer between language models (LMs). However, a pair of recent studies reported being unable to successfully transfer image jailbreaks between vision-language models (VLMs). To explain this striking difference, we propose a fundamental distinction regarding the transferability of attacks against machine learning models: attacks in the input data-space can transfer, whereas attacks in model representation space do not, at least not without geometric alignment of representations. We then provide theoretical and empirical evidence of this hypothesis in four different settings. First, we mathematically prove this distinction in a simple setting where two networks compute the same input-output map but via different representations. Second, we construct representation-space attacks against image classifiers that are as successful as well-known data-space attacks, but fail to transfer. Third, we construct representation-space attacks against LMs that successfully jailbreak the attacked models but again fail to transfer. Fourth, we construct data-space attacks against VLMs that successfully transfer to new VLMs, and we show that representation space attacks can transfer when VLMs' latent geometries are sufficiently aligned in post-projector space. Our work reveals that adversarial transfer is not an inherent property of all attacks but contingent on their operational domain - the shared data-space versus models' unique representation spaces - a critical insight for building more robust models.

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

TL;DR

The paper investigates why adversarial transfer differs between data-space and representation-space attacks. It introduces a mathematical model showing perfect transfer for data-space attacks between functionally equivalent networks, while representation-space attacks require stringent geometric alignment, with transfer vanishing in high dimensions otherwise. Empirically, data-space attacks transfer for image classifiers and vision-language models, whereas representation-space attacks rarely transfer unless latent geometries are closely aligned; text-based attacks readily transfer across language-model families, but soft-prompt attacks typically do not unless representations align. The work further demonstrates that, under geometric alignment, representation-space attacks can transfer in both language and vision-language models, highlighting latent-space structure as a key determinant. These insights have practical implications for designing robust multimodal systems and understanding when adversarial attacks may generalize across models.

Abstract

The field of adversarial robustness has long established that adversarial examples can successfully transfer between image classifiers and that text jailbreaks can successfully transfer between language models (LMs). However, a pair of recent studies reported being unable to successfully transfer image jailbreaks between vision-language models (VLMs). To explain this striking difference, we propose a fundamental distinction regarding the transferability of attacks against machine learning models: attacks in the input data-space can transfer, whereas attacks in model representation space do not, at least not without geometric alignment of representations. We then provide theoretical and empirical evidence of this hypothesis in four different settings. First, we mathematically prove this distinction in a simple setting where two networks compute the same input-output map but via different representations. Second, we construct representation-space attacks against image classifiers that are as successful as well-known data-space attacks, but fail to transfer. Third, we construct representation-space attacks against LMs that successfully jailbreak the attacked models but again fail to transfer. Fourth, we construct data-space attacks against VLMs that successfully transfer to new VLMs, and we show that representation space attacks can transfer when VLMs' latent geometries are sufficiently aligned in post-projector space. Our work reveals that adversarial transfer is not an inherent property of all attacks but contingent on their operational domain - the shared data-space versus models' unique representation spaces - a critical insight for building more robust models.

Paper Structure

This paper contains 35 sections, 30 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed. Adversarial attacks can be applied to the input datum ("data-space attack") or to a network's representation of the input datum ("representation space attack") (left). We hypothesis this distinction explains why adversarial examples can transfer between image classifiers and why text jailbreaks can transfer between language models, but image jailbreaks were seemingly unable to transfer between vision-language models. Intuitively, networks trained on similar data and similar losses learn similar input-output maps, even though their representations may differ geometrically; consequently, data-space attacks cause similar harms against new models (center), whereas representation-space attacks are unlikely to transfer without additional forces to align representation spaces (right).
  • Figure 2: Image Classifiers: Data-Space Attacks Transfer, Representation-Space Attacks Do Not. ResNet18 image classifiers are trained on CIFAR10 to $\mathord{\sim}95\%$ classification accuracy. Universal attacks optimized on the raw input images have similar or slightly lower attack success rates (ASR) on transfer models than on the source models (left). In contrast, attacks optimized at any of the latent layers yield significantly reduced ASR on transfer models, e.g., Layer 1 (center) and Layer 5 (right). Representation attacks at Layer 1 achieve the highest transfer success (center).
  • Figure 3: Language Models: Representation-Space Attacks Do Not Transfer. We consider three sets of language models with the same hidden dimension. We observe that soft prompts optimized on one model and applied to another are overwhelmingly ineffective, and mostly do not provoke an increase in harmful output. We attack each model five times, optimizing and evaluating independently each time, visualized by separate markers. The attacked model is indicated in the label in the top right corner. Results for attacks on all 8 language models are provided in Fig. \ref{['fig:softprompt-lms-full']}.
  • Figure 4: Vision-Language Models: Data-Space Attacks Can Transfer. In contrast to the non-transferability of image jailbreaks between the Prismatic VLMs karamcheti2024prismaticvlmsinvestigatingdesign, we create text jailbreaks that can successfully transfer between Prismatic VLMs. The attacked model pair is labeled in the bottom right. A key conceptual understanding is that from the "perspective" of VLMs, text is the data-space, whereas image inputs are more akin to representation perturbations; this is more intuitively true in adapter-based VLMs such as LLaVA liu2023visual.
  • Figure 5: Language Models: Representation-Space Attacks Can Transfer Between Finetuned Variants of the Same Starting Model. We attack several of the finetune checkpoints of Llama3 3B. Similarly to the soft prompt attacks on independent language models, we find that attack success on the source model varies strongly with randomness. However, we observe consistent strong transfer with many of the attacks achieving the same ASR as the source models. This applies both the models derived from finetuning with other datasets, as well as models derived from different checkpoints of the same finetune. We provide additional results in Fig. \ref{['fig:finetune-transfer-full']}.
  • ...and 12 more figures