Self-Supervised Representation Learning for Adversarial Attack Detection
Yi Li, Plamen Angelov, Neeraj Suri
TL;DR
This paper tackles adversarial attack detection with a self-supervised framework that requires no labeled adversarial examples. It introduces pixel mapping with a loss $\mathcal{L}_{\text{PM}}$, prototype-wise contrastive estimation $\mathcal{L}_{\text{PCE}}$, and an instance-discrimination memory (discrimination bank) via $\mathcal{L}_{\text{ICL}}$, all learned by a parallel axial-attention encoder (PAA-ResNet). Across ImageNet, CIFAR-10, and COCO, the approach achieves state-of-the-art detection performance on unseen attacks while maintaining efficiency due to parallelized attention and training-time discriminative memory. The results demonstrate robust, transferable representations that mitigate labeling needs and adapt to novel datasets and attack algorithms, making the method practically impactful for secure AI systems.
Abstract
Supervised learning-based adversarial attack detection methods rely on a large number of labeled data and suffer significant performance degradation when applying the trained model to new domains. In this paper, we propose a self-supervised representation learning framework for the adversarial attack detection task to address this drawback. Firstly, we map the pixels of augmented input images into an embedding space. Then, we employ the prototype-wise contrastive estimation loss to cluster prototypes as latent variables. Additionally, drawing inspiration from the concept of memory banks, we introduce a discrimination bank to distinguish and learn representations for each individual instance that shares the same or a similar prototype, establishing a connection between instances and their associated prototypes. We propose a parallel axial-attention (PAA)-based encoder to facilitate the training process by parallel training over height- and width-axis of attention maps. Experimental results show that, compared to various benchmark self-supervised vision learning models and supervised adversarial attack detection methods, the proposed model achieves state-of-the-art performance on the adversarial attack detection task across a wide range of images.
