Adversarial Feature Alignment: Balancing Robustness and Accuracy in Deep Learning via Adversarial Training

Leo Hyun Park; Jaeuk Kim; Myung Gyo Oh; Jaewoo Park; Taekyoung Kwon

Adversarial Feature Alignment: Balancing Robustness and Accuracy in Deep Learning via Adversarial Training

Leo Hyun Park, Jaeuk Kim, Myung Gyo Oh, Jaewoo Park, Taekyoung Kwon

TL;DR

This work identifies misalignment in feature representations as a root cause of the robustness-accuracy tradeoff in deep nets and proposes Adversarial Feature Alignment (AFA), a robust pre-training method that uses a fully supervised contrastive loss within a min–max adversarial framework to align features. By targeting the penultimate layer and employing adversarially augmented positives/negatives, AFA achieves stronger robustness while maintaining high clean accuracy, outperforming prior adversarial training and adversarial contrastive learning methods. The authors further show that combining AFA with TRADES and diffusion-model–based data augmentation yields state-of-the-art results on CIFAR10/100 and competitive performance on larger datasets, supported by analyses of feature-space alignment and Lipschitzness. Overall, AFA advances practical robustness by aligning class manifolds in feature space, enabling more reliable generalization under adversarial perturbations and real-world distribution shifts.

Abstract

Deep learning models continue to advance in accuracy, yet they remain vulnerable to adversarial attacks, which often lead to the misclassification of adversarial examples. Adversarial training is used to mitigate this problem by increasing robustness against these attacks. However, this approach typically reduces a model's standard accuracy on clean, non-adversarial samples. The necessity for deep learning models to balance both robustness and accuracy for security is obvious, but achieving this balance remains challenging, and the underlying reasons are yet to be clarified. This paper proposes a novel adversarial training method called Adversarial Feature Alignment (AFA), to address these problems. Our research unveils an intriguing insight: misalignment within the feature space often leads to misclassification, regardless of whether the samples are benign or adversarial. AFA mitigates this risk by employing a novel optimization algorithm based on contrastive learning to alleviate potential feature misalignment. Through our evaluations, we demonstrate the superior performance of AFA. The baseline AFA delivers higher robust accuracy than previous adversarial contrastive learning methods while minimizing the drop in clean accuracy to 1.86% and 8.91% on CIFAR10 and CIFAR100, respectively, in comparison to cross-entropy. We also show that joint optimization of AFA and TRADES, accompanied by data augmentation using a recent diffusion model, achieves state-of-the-art accuracy and robustness.

Adversarial Feature Alignment: Balancing Robustness and Accuracy in Deep Learning via Adversarial Training

TL;DR

Abstract

Paper Structure (42 sections, 2 theorems, 9 equations, 5 figures, 16 tables, 1 algorithm)

This paper contains 42 sections, 2 theorems, 9 equations, 5 figures, 16 tables, 1 algorithm.

Introduction
Background
Preliminary Concept
Threat Model
Adversarial attack
Adversarial training
Robustness and Accuracy Need Alignment
Properties of Data Manifold
Separation Is Not Enough: Alignment Helps
Misalignment Problem in Feature Space
Feature Misalignment of Different Layers
Correlation between Misaligned Classes and Misclassified Classes
Adversarial Feature Alignment
Revisiting Contrastive Loss Function
Principles for Adversarial Feature Alignment
...and 27 more sections

Key Result

lemma 1

$f_\text{1-nn}$ has accuracy of 1 on the $R$-aligned data distribution.

Figures (5)

Figure 1: Visual illustrations of data distribution manifolds with colors indicating sample labels. (a) Illustrates the robustness-accuracy tradeoff problem regarding standard clean accuracy and robust accuracy. (b) Shows misaligned distribution where test sample 'x' might differ in color from its nearest training sample 'o', despite large class distances. (c) Depicts aligned distribution satisfying both separation and clustering, ensuring test sample 'x' matches the color of training sample 'o' if the minimum class distance is at least twice the radius of 'o'.
Figure 2: Accuracy of clean samples (solid lines) and PGD adversarial examples (dashed lines) on each neural network layer with different training methods. Vertical dashed lines indicate the penultimate layer, and vertical solid lines indicate the logit layer. For layers before the logit layer (i.e., the last layer), the accuracy indicates the accuracy of the $f_\text{1-nn}$. The accuracy of the logit layer is identical to the classification accuracy of the network. PGD examples were generated from clean test samples with $\epsilon=8/255$, step size $\alpha=2/255$, and attack iteration $N=20$. We measured the $l_1$ distance between the layer output of a test sample and those of train samples.
Figure 3: Overview of Adversarial Feature Alignment (AFA): AFA incorporates inner and outer optimization steps in each training epoch. (a) In the inner step, an adversarial example $\tilde{x}+\delta$ is generated to maximize the AFA adversarial loss, optimizing feature vector distances from both positive and negative examples. (b) The feature extractor $g$ is then updated to minimize the feature vector distance from positive examples and maximize it from negatives. (c) Post-AFA training, class samples are efficiently clustered into their respective classes, optimizing both intra-cluster closeness and inter-cluster separation.
Figure 4: Robust accuracy of training methods against varying attack strength. (a) We set $\epsilon=8/255$ and $\alpha=2/255$ for the PGD attack while changing the number of attack iterations $N$. (b) We set the number of iterations $N=20$ and $\alpha=\epsilon/4$ while changing the perturbation size $\epsilon$. The target model is ResNet-18 and experimented dataset is CIFAR10.
Figure 5: t-SNE visualization of feature spaces represented by adversarial training methods on CIFAR10. While PGD and AdvCL result in widespread feature spaces, our adversarial feature alignment method separates classes more distinctly.

Theorems & Definitions (5)

definition 1: $r$-separation yang2020closer
definition 2: $R$-clustering
definition 3: $R$-alignment
lemma 1
theorem 1

Adversarial Feature Alignment: Balancing Robustness and Accuracy in Deep Learning via Adversarial Training

TL;DR

Abstract

Adversarial Feature Alignment: Balancing Robustness and Accuracy in Deep Learning via Adversarial Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (5)