Pulling Back the Curtain: Unsupervised Adversarial Detection via Contrastive Auxiliary Networks
Eylon Mizrahi, Raz Lapid, Moshe Sipper
TL;DR
This work tackles the vulnerability of deep models to adversarial perturbations by introducing U-CAN, an unsupervised adversarial detector that refines intermediate-layer features via lightweight Projection and ArcFace-based auxiliary blocks. The auxiliary blocks operate on a frozen backbone and, through layer-wise fusion, produce a compact detection vector without using adversarial examples for training. Empirical results across CIFAR-10, Mammals, and an ImageNet subset demonstrate that U-CAN improves F1 scores relative to existing unsupervised detectors, while maintaining low latency and modest memory overhead. The method is compatible with other feature-based detectors (e.g., DNR, DKNN) and strengthens robustness for safety-critical applications, with future work aiming to broaden task domains and attack analyses.
Abstract
Deep learning models are widely employed in safety-critical applications yet remain susceptible to adversarial attacks -- imperceptible perturbations that can significantly degrade model performance. Conventional defense mechanisms predominantly focus on either enhancing model robustness or detecting adversarial inputs independently. In this work, we propose an Unsupervised adversarial detection via Contrastive Auxiliary Networks (U-CAN) to uncover adversarial behavior within auxiliary feature representations, without the need for adversarial examples. U-CAN is embedded within selected intermediate layers of the target model. These auxiliary networks, comprising projection layers and ArcFace-based linear layers, refine feature representations to more effectively distinguish between benign and adversarial inputs. Comprehensive experiments across multiple datasets (CIFAR-10, Mammals, and a subset of ImageNet) and architectures (ResNet-50, VGG-16, and ViT) demonstrate that our method surpasses existing unsupervised adversarial detection techniques, achieving superior F1 scores against four distinct attack methods. The proposed framework provides a scalable and effective solution for enhancing the security and reliability of deep learning systems.
