Table of Contents
Fetching ...

TRAIL: Transferable Robust Adversarial Images via Latent diffusion

Yuhao Xue, Zhifei Zhang, Xinyang Jiang, Yifei Shen, Junyao Gao, Wentao Gu, Jiale Zhao, Miaojing Shi, Cairong Zhao

TL;DR

TRAIL tackles the transferability gap in unrestricted adversarial attacks by performing test-time adaptation of a latent diffusion model to synthesize perturbations that follow the real-world image distribution $p(x)$ while embedding robust features $p(x+\delta)$. It optimizes an adversarial loss $\mathcal{L}_{adv}$ together with a distance loss $\mathcal{L}_{dis}$ during diffusion denoising, uses gradient guidance $\mathcal{G}_t$ to steer generation, and employs a one-step backpropagation for efficiency. Empirically, TRAIL delivers superior cross-model transferability across CNNs and ViTs, bypasses common defenses, and even enables black-box attacks on vision-language models, with a theoretical proposition bounding latent perturbations under diffusion dynamics. The work highlights distribution-aligned adversarial feature synthesis as crucial for practical black-box attacks and introduces a new attack paradigm with potential security implications.

Abstract

Adversarial attacks exploiting unrestricted natural perturbations present severe security risks to deep learning systems, yet their transferability across models remains limited due to distribution mismatches between generated adversarial features and real-world data. While recent works utilize pre-trained diffusion models as adversarial priors, they still encounter challenges due to the distribution shift between the distribution of ideal adversarial samples and the natural image distribution learned by the diffusion model. To address the challenge, we propose Transferable Robust Adversarial Images via Latent Diffusion (TRAIL), a test-time adaptation framework that enables the model to generate images from a distribution of images with adversarial features and closely resembles the target images. To mitigate the distribution shift, during attacks, TRAIL updates the diffusion U-Net's weights by combining adversarial objectives (to mislead victim models) and perceptual constraints (to preserve image realism). The adapted model then generates adversarial samples through iterative noise injection and denoising guided by these objectives. Experiments demonstrate that TRAIL significantly outperforms state-of-the-art methods in cross-model attack transferability, validating that distribution-aligned adversarial feature synthesis is critical for practical black-box attacks.

TRAIL: Transferable Robust Adversarial Images via Latent diffusion

TL;DR

TRAIL tackles the transferability gap in unrestricted adversarial attacks by performing test-time adaptation of a latent diffusion model to synthesize perturbations that follow the real-world image distribution while embedding robust features . It optimizes an adversarial loss together with a distance loss during diffusion denoising, uses gradient guidance to steer generation, and employs a one-step backpropagation for efficiency. Empirically, TRAIL delivers superior cross-model transferability across CNNs and ViTs, bypasses common defenses, and even enables black-box attacks on vision-language models, with a theoretical proposition bounding latent perturbations under diffusion dynamics. The work highlights distribution-aligned adversarial feature synthesis as crucial for practical black-box attacks and introduces a new attack paradigm with potential security implications.

Abstract

Adversarial attacks exploiting unrestricted natural perturbations present severe security risks to deep learning systems, yet their transferability across models remains limited due to distribution mismatches between generated adversarial features and real-world data. While recent works utilize pre-trained diffusion models as adversarial priors, they still encounter challenges due to the distribution shift between the distribution of ideal adversarial samples and the natural image distribution learned by the diffusion model. To address the challenge, we propose Transferable Robust Adversarial Images via Latent Diffusion (TRAIL), a test-time adaptation framework that enables the model to generate images from a distribution of images with adversarial features and closely resembles the target images. To mitigate the distribution shift, during attacks, TRAIL updates the diffusion U-Net's weights by combining adversarial objectives (to mislead victim models) and perceptual constraints (to preserve image realism). The adapted model then generates adversarial samples through iterative noise injection and denoising guided by these objectives. Experiments demonstrate that TRAIL significantly outperforms state-of-the-art methods in cross-model attack transferability, validating that distribution-aligned adversarial feature synthesis is critical for practical black-box attacks.

Paper Structure

This paper contains 25 sections, 26 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of our approach. (a) We adapt the diffusion model with the optimization goal of generating more effective adversarial images while minimizing modifications to the original image. We use this objective to guide the adaptation and generation processes, enhancing the attack effectiveness of the generated images. (b) We present the complete process of generating an adversarial image. We employ test-time adaptation (TTA) by updating the diffusion model’s weights to better align with our adversarial objective. We introduce guidance in denoising process to enhance the attack effective of our generated image (Section \ref{['subsec:diff-generation']}). During TTA, we perform backpropagation for adaptation in a single prediction step. (Section \ref{['subsec:diffusion_fine_tuning']}).
  • Figure 2: Compared to other methods, our approach produces more effective adversarial images with minimal alterations to the original images. The avg. ASR (%) represents the average attack success rate in a black-box setting. A higher value indicates stronger attack transferability.
  • Figure 3: Trade-off between attack performance and stealthiness depending on the choice of $t^*$.
  • Figure 4: The evaluation process on LLaVA. In the left column, LLaVA evaluates original images and correctly associates them with their labels. In the right column, LLaVA analyzes adversarial images generated by TRAIL on the surrogate model and is misled into rejecting the correct label.
  • Figure 5: We visualize adversarial examples generated under the same settings with different $t^*$ values. As $t^*$ increases, the overall image content remains unchanged, but the quality degrades due to the excessive injection of adversarial perturbations. (We selected some more noticeable examples to compare the effects of different $t^*$ values.)