Table of Contents
Fetching ...

LISArD: Learning Image Similarity to Defend Against Gray-box Adversarial Attacks

Joana C. Costa, Tiago Roxo, Hugo Proença, Pedro R. M. Inácio

TL;DR

This work tackles the realism gap in adversarial robustness by addressing gray-box attacks, where the attacker knows the architecture and training data but not model gradients. It introduces LISArD, a no-AT defense that jointly learns image similarity and classification by driving the cross-correlation matrix $M$ between clean and perturbed embeddings toward a diagonal identity, thereby reducing perturbation impact. The method leverages a combined loss ${\mathcal L} = \alpha ( {\mathcal L}_{C}+{\mathcal L}_{R}) + (1-\alpha) ( {\mathcal L}_{S}/\tau)$ with a gradually increasing $\alpha$ and a temperature $\tau$, training across multiple backbones and datasets without extra training cost. Empirical results show LISArD provides strong gray-box robustness across architectures and retains resilience in white-box settings, outperforming several Adversarial Distillation baselines, while offering practical benefits such as no additional models or adversarial samples during training. The work highlights the importance of image similarity objectives for robustness in realistic threat models and provides a publicly available implementation.

Abstract

State-of-the-art defense mechanisms are typically evaluated in the context of white-box attacks, which is not realistic, as it assumes the attacker can access the gradients of the target network. To protect against this scenario, Adversarial Training (AT) and Adversarial Distillation (AD) include adversarial examples during the training phase, and Adversarial Purification uses a generative model to reconstruct all the images given to the classifier. This paper considers an even more realistic evaluation scenario: gray-box attacks, which assume that the attacker knows the architecture and the dataset used to train the target network, but cannot access its gradients. We provide empirical evidence that models are vulnerable to gray-box attacks and propose LISArD, a defense mechanism that does not increase computational and temporal costs but provides robustness against gray-box and white-box attacks without including AT. Our method approximates a cross-correlation matrix, created with the embeddings of perturbed and clean images, to a diagonal matrix while simultaneously conducting classification learning. Our results show that LISArD can effectively protect against gray-box attacks, can be used in multiple architectures, and carries over its resilience to the white-box scenario. Also, state-of-the-art AD models underperform greatly when removing AT and/or moving to gray-box settings, highlighting the lack of robustness from existing approaches to perform in various conditions (aside from white-box settings). All the source code is available at https://github.com/Joana-Cabral/LISArD.

LISArD: Learning Image Similarity to Defend Against Gray-box Adversarial Attacks

TL;DR

This work tackles the realism gap in adversarial robustness by addressing gray-box attacks, where the attacker knows the architecture and training data but not model gradients. It introduces LISArD, a no-AT defense that jointly learns image similarity and classification by driving the cross-correlation matrix between clean and perturbed embeddings toward a diagonal identity, thereby reducing perturbation impact. The method leverages a combined loss with a gradually increasing and a temperature , training across multiple backbones and datasets without extra training cost. Empirical results show LISArD provides strong gray-box robustness across architectures and retains resilience in white-box settings, outperforming several Adversarial Distillation baselines, while offering practical benefits such as no additional models or adversarial samples during training. The work highlights the importance of image similarity objectives for robustness in realistic threat models and provides a publicly available implementation.

Abstract

State-of-the-art defense mechanisms are typically evaluated in the context of white-box attacks, which is not realistic, as it assumes the attacker can access the gradients of the target network. To protect against this scenario, Adversarial Training (AT) and Adversarial Distillation (AD) include adversarial examples during the training phase, and Adversarial Purification uses a generative model to reconstruct all the images given to the classifier. This paper considers an even more realistic evaluation scenario: gray-box attacks, which assume that the attacker knows the architecture and the dataset used to train the target network, but cannot access its gradients. We provide empirical evidence that models are vulnerable to gray-box attacks and propose LISArD, a defense mechanism that does not increase computational and temporal costs but provides robustness against gray-box and white-box attacks without including AT. Our method approximates a cross-correlation matrix, created with the embeddings of perturbed and clean images, to a diagonal matrix while simultaneously conducting classification learning. Our results show that LISArD can effectively protect against gray-box attacks, can be used in multiple architectures, and carries over its resilience to the white-box scenario. Also, state-of-the-art AD models underperform greatly when removing AT and/or moving to gray-box settings, highlighting the lack of robustness from existing approaches to perform in various conditions (aside from white-box settings). All the source code is available at https://github.com/Joana-Cabral/LISArD.

Paper Structure

This paper contains 14 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison between the information available to an attacker when considering the different types of attacks. Image/Predictions Pairs refers only accessing a set of images given to the model and the respective prediction, Data and Architecture refers to knowing the target model architecture and dataset used to train it, and Model Gradients refers to controlling the model loss function.
  • Figure 2: Types of approaches commonly used to defend against adversarial attacks. The Teacher Model refers to a previously trained model, usually bigger than the Student Model, that aids the latter by providing soft labels. The DDPM refers to a Denoising Diffusion Probabilistic Model (a generative model) that uses noise and denoise to produce a "purified" image.
  • Figure 3: Overview of the conversion from embeddings to a matrix in the Learning Image Similarity component. $E$ refers to the size of the embeddings, which vary depending on the selected model.
  • Figure 4: Overview of the LISArD architecture. The clean and noisy images are fed to the model, and the inner product is calculated using their respective embeddings. Both clean (orange) and noisy embeddings (green) are used to predict each class using an adaptive weight loss between $\mathcal{L}_{C}$ and $\mathcal{L}_{R}$ and $\mathcal{L}_{S}$.
  • Figure 5: Comparison of the distributions for clean (blue) and attacked (red) images when considering a ResNet (left) and LISArD (right) for CIFAR-10. $\textit{d}'$ refers to the decidability measure, where values closer to 0 mean greater overlap between distributions.
  • ...and 2 more figures