LISArD: Learning Image Similarity to Defend Against Gray-box Adversarial Attacks
Joana C. Costa, Tiago Roxo, Hugo Proença, Pedro R. M. Inácio
TL;DR
This work tackles the realism gap in adversarial robustness by addressing gray-box attacks, where the attacker knows the architecture and training data but not model gradients. It introduces LISArD, a no-AT defense that jointly learns image similarity and classification by driving the cross-correlation matrix $M$ between clean and perturbed embeddings toward a diagonal identity, thereby reducing perturbation impact. The method leverages a combined loss ${\mathcal L} = \alpha ( {\mathcal L}_{C}+{\mathcal L}_{R}) + (1-\alpha) ( {\mathcal L}_{S}/\tau)$ with a gradually increasing $\alpha$ and a temperature $\tau$, training across multiple backbones and datasets without extra training cost. Empirical results show LISArD provides strong gray-box robustness across architectures and retains resilience in white-box settings, outperforming several Adversarial Distillation baselines, while offering practical benefits such as no additional models or adversarial samples during training. The work highlights the importance of image similarity objectives for robustness in realistic threat models and provides a publicly available implementation.
Abstract
State-of-the-art defense mechanisms are typically evaluated in the context of white-box attacks, which is not realistic, as it assumes the attacker can access the gradients of the target network. To protect against this scenario, Adversarial Training (AT) and Adversarial Distillation (AD) include adversarial examples during the training phase, and Adversarial Purification uses a generative model to reconstruct all the images given to the classifier. This paper considers an even more realistic evaluation scenario: gray-box attacks, which assume that the attacker knows the architecture and the dataset used to train the target network, but cannot access its gradients. We provide empirical evidence that models are vulnerable to gray-box attacks and propose LISArD, a defense mechanism that does not increase computational and temporal costs but provides robustness against gray-box and white-box attacks without including AT. Our method approximates a cross-correlation matrix, created with the embeddings of perturbed and clean images, to a diagonal matrix while simultaneously conducting classification learning. Our results show that LISArD can effectively protect against gray-box attacks, can be used in multiple architectures, and carries over its resilience to the white-box scenario. Also, state-of-the-art AD models underperform greatly when removing AT and/or moving to gray-box settings, highlighting the lack of robustness from existing approaches to perform in various conditions (aside from white-box settings). All the source code is available at https://github.com/Joana-Cabral/LISArD.
