Table of Contents
Fetching ...

Nearest Neighbor Projection Removal Adversarial Training

Himanshu Singh, A. V. Subramanyam, Shivank Rajput, Mohan Kankanhalli

TL;DR

This work addresses adversarial vulnerability stemming from inter-class feature overlap by introducing Nearest Neighbor Projection Removal Adversarial Training (nnPRAT). nnPRAT identifies the nearest inter-class neighbor in the feature space and removes the projection of both adversarial and clean features onto that neighbor, yielding a logits correction that contracts the last-layer Lipschitz constant and reduces Rademacher complexity. The method demonstrates consistent robustness improvements across CIFAR-10, CIFAR-100, and SVHN, and scales to larger architectures like WRN-34-10 and TinyImageNet while preserving clean accuracy. The results underscore the value of geometry-aware regularization in adversarial training and corroborate the theoretical link between feature-space disentanglement and improved generalization under attack.

Abstract

Deep neural networks have exhibited impressive performance in image classification tasks but remain vulnerable to adversarial examples. Standard adversarial training enhances robustness but typically fails to explicitly address inter-class feature overlap, a significant contributor to adversarial susceptibility. In this work, we introduce a novel adversarial training framework that actively mitigates inter-class proximity by projecting out inter-class dependencies from adversarial and clean samples in the feature space. Specifically, our approach first identifies the nearest inter-class neighbors for each adversarial sample and subsequently removes projections onto these neighbors to enforce stronger feature separability. Theoretically, we demonstrate that our proposed logits correction reduces the Lipschitz constant of neural networks, thereby lowering the Rademacher complexity, which directly contributes to improved generalization and robustness. Extensive experiments across standard benchmarks including CIFAR-10, CIFAR-100, and SVHN show that our method demonstrates strong performance that is competitive with leading adversarial training techniques, highlighting significant achievements in both robust and clean accuracy. Our findings reveal the importance of addressing inter-class feature proximity explicitly to bolster adversarial robustness in DNNs.

Nearest Neighbor Projection Removal Adversarial Training

TL;DR

This work addresses adversarial vulnerability stemming from inter-class feature overlap by introducing Nearest Neighbor Projection Removal Adversarial Training (nnPRAT). nnPRAT identifies the nearest inter-class neighbor in the feature space and removes the projection of both adversarial and clean features onto that neighbor, yielding a logits correction that contracts the last-layer Lipschitz constant and reduces Rademacher complexity. The method demonstrates consistent robustness improvements across CIFAR-10, CIFAR-100, and SVHN, and scales to larger architectures like WRN-34-10 and TinyImageNet while preserving clean accuracy. The results underscore the value of geometry-aware regularization in adversarial training and corroborate the theoretical link between feature-space disentanglement and improved generalization under attack.

Abstract

Deep neural networks have exhibited impressive performance in image classification tasks but remain vulnerable to adversarial examples. Standard adversarial training enhances robustness but typically fails to explicitly address inter-class feature overlap, a significant contributor to adversarial susceptibility. In this work, we introduce a novel adversarial training framework that actively mitigates inter-class proximity by projecting out inter-class dependencies from adversarial and clean samples in the feature space. Specifically, our approach first identifies the nearest inter-class neighbors for each adversarial sample and subsequently removes projections onto these neighbors to enforce stronger feature separability. Theoretically, we demonstrate that our proposed logits correction reduces the Lipschitz constant of neural networks, thereby lowering the Rademacher complexity, which directly contributes to improved generalization and robustness. Extensive experiments across standard benchmarks including CIFAR-10, CIFAR-100, and SVHN show that our method demonstrates strong performance that is competitive with leading adversarial training techniques, highlighting significant achievements in both robust and clean accuracy. Our findings reveal the importance of addressing inter-class feature proximity explicitly to bolster adversarial robustness in DNNs.

Paper Structure

This paper contains 25 sections, 2 theorems, 13 equations, 4 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

Let $z$ and $\tilde{z}$ be the sample and nearest neighbor's logits. Then the projection removal step induces a spectral norm contraction given by $\|W_r'\|_{\mathrm{op}} \;\le\;(1-\alpha)\,\|W_r\|_{\mathrm{op}}$, where $\alpha \in (0,1)$.

Figures (4)

  • Figure 1: Visualization of the PCA-reduced feature space from a FGSM-trained MNIST model. The red digits indicate the query points, while the other blue digits represent their top-10 nearest neighbors from various classes. Despite adversarial training, queries are majorly surrounded by single off-class neighbors, indicating persistent inter-class entanglement in the representation.
  • Figure 2: Effect of projection-removal in the two-dimensional feature space. (a) Input space depicting the decision boundaries. The solid line is the baseline classifier, and the dashed line is after projection removal training. Our method provides a wider, smoother margin. (b) Two‐dimensional PCA projection of the penultimate‐layer activations for the standard trained model. (c) PCA projection of the same activations with projection removal training, exhibiting markedly tighter and more distinct class clusters.
  • Figure 3: Clean (circle) and robust (square) accuracy under different (a)$\lambda$ and (b)$\beta$ values. Shaded areas show the clean–robust gap.
  • Figure 4: t-SNE visualization of CIFAR-100 on ResNet18 without projection removal training (left) and with projection removal training (right).

Theorems & Definitions (2)

  • Lemma 1
  • Lemma 2