Table of Contents
Fetching ...

Pulling Back the Curtain: Unsupervised Adversarial Detection via Contrastive Auxiliary Networks

Eylon Mizrahi, Raz Lapid, Moshe Sipper

TL;DR

This work tackles the vulnerability of deep models to adversarial perturbations by introducing U-CAN, an unsupervised adversarial detector that refines intermediate-layer features via lightweight Projection and ArcFace-based auxiliary blocks. The auxiliary blocks operate on a frozen backbone and, through layer-wise fusion, produce a compact detection vector without using adversarial examples for training. Empirical results across CIFAR-10, Mammals, and an ImageNet subset demonstrate that U-CAN improves F1 scores relative to existing unsupervised detectors, while maintaining low latency and modest memory overhead. The method is compatible with other feature-based detectors (e.g., DNR, DKNN) and strengthens robustness for safety-critical applications, with future work aiming to broaden task domains and attack analyses.

Abstract

Deep learning models are widely employed in safety-critical applications yet remain susceptible to adversarial attacks -- imperceptible perturbations that can significantly degrade model performance. Conventional defense mechanisms predominantly focus on either enhancing model robustness or detecting adversarial inputs independently. In this work, we propose an Unsupervised adversarial detection via Contrastive Auxiliary Networks (U-CAN) to uncover adversarial behavior within auxiliary feature representations, without the need for adversarial examples. U-CAN is embedded within selected intermediate layers of the target model. These auxiliary networks, comprising projection layers and ArcFace-based linear layers, refine feature representations to more effectively distinguish between benign and adversarial inputs. Comprehensive experiments across multiple datasets (CIFAR-10, Mammals, and a subset of ImageNet) and architectures (ResNet-50, VGG-16, and ViT) demonstrate that our method surpasses existing unsupervised adversarial detection techniques, achieving superior F1 scores against four distinct attack methods. The proposed framework provides a scalable and effective solution for enhancing the security and reliability of deep learning systems.

Pulling Back the Curtain: Unsupervised Adversarial Detection via Contrastive Auxiliary Networks

TL;DR

This work tackles the vulnerability of deep models to adversarial perturbations by introducing U-CAN, an unsupervised adversarial detector that refines intermediate-layer features via lightweight Projection and ArcFace-based auxiliary blocks. The auxiliary blocks operate on a frozen backbone and, through layer-wise fusion, produce a compact detection vector without using adversarial examples for training. Empirical results across CIFAR-10, Mammals, and an ImageNet subset demonstrate that U-CAN improves F1 scores relative to existing unsupervised detectors, while maintaining low latency and modest memory overhead. The method is compatible with other feature-based detectors (e.g., DNR, DKNN) and strengthens robustness for safety-critical applications, with future work aiming to broaden task domains and attack analyses.

Abstract

Deep learning models are widely employed in safety-critical applications yet remain susceptible to adversarial attacks -- imperceptible perturbations that can significantly degrade model performance. Conventional defense mechanisms predominantly focus on either enhancing model robustness or detecting adversarial inputs independently. In this work, we propose an Unsupervised adversarial detection via Contrastive Auxiliary Networks (U-CAN) to uncover adversarial behavior within auxiliary feature representations, without the need for adversarial examples. U-CAN is embedded within selected intermediate layers of the target model. These auxiliary networks, comprising projection layers and ArcFace-based linear layers, refine feature representations to more effectively distinguish between benign and adversarial inputs. Comprehensive experiments across multiple datasets (CIFAR-10, Mammals, and a subset of ImageNet) and architectures (ResNet-50, VGG-16, and ViT) demonstrate that our method surpasses existing unsupervised adversarial detection techniques, achieving superior F1 scores against four distinct attack methods. The proposed framework provides a scalable and effective solution for enhancing the security and reliability of deep learning systems.

Paper Structure

This paper contains 29 sections, 5 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of our proposed method. The input $x$ passes through a frozen target model $\mathcal{M}$ with layers $\{L_1, L_2 \dots, L_N\}$, yielding features $\{\mathbf{z_1, z_2}, \dots, \mathbf{z_N}\}$. Each $\mathbf z_k$ is fed to an Aux. Block$\mathcal{A}_k$ ($1{\times}1$ conv $\rightarrow$ adaptive-avg-pool $\rightarrow$ flatten $\rightarrow$$\ell_2$-norm), producing refined vectors $\{\tilde{\mathbf{p}}_1, \tilde{\mathbf{p}}_1, \dots, \tilde{\mathbf{p}}_N\}$ that lie on unit hyperspheres anchored by the ArcFace learnable class centers $\mathbf{W_k} \in \mathbb{R}^{CL\times{d'}}$. Adversarial shifts (black stars) become more distinguishable from the well-separated benign clusters (green circles). An aggregator $\mathcal{G}$ is applied to the $S$ most informative auxiliaries---combines their outputs into an adversarial detection vector $\mathbf{v}$.
  • Figure 2: Layer-wise t-SNE-reduced feature visualizations of ResNet-50 on ImageNet validation set: Top--raw features $\{\mathbf{z}_n\}_{1}^{16}$; Bottom--U-CAN’s contrastive features $\{\tilde{\mathbf{p}}_n\}_{1}^{16}$. From top-left ($L_0$) to bottom-right ($L_{16}$), each plot shows the benign class clusters (colors) and a single adversarial sample (black star). Without U-CAN, adversarial points blend in; with U-CAN, refined features sharpen class boundaries, exposing adversarial crossings.
  • Figure 3: Average precision‑recall curves for each method on all datasets, models, attacks, and $\epsilon$ values. The thicker point marks the best F1, the transparent band is the scaled variance.