Table of Contents
Fetching ...

Conflict-Aware Adversarial Training

Zhiyu Xue, Haohan Wang, Yao Qin, Ramtin Pedarsani

TL;DR

A new trade-off paradigm for adversarial training with a conflict-aware factor for the convex combination of standard and adversarial loss, named Conflict-Aware Adversarial Training (CA-AT), which consistently offers a superior trade-off between standard performance and adversarial robustness.

Abstract

Adversarial training is the most effective method to obtain adversarial robustness for deep neural networks by directly involving adversarial samples in the training procedure. To obtain an accurate and robust model, the weighted-average method is applied to optimize standard loss and adversarial loss simultaneously. In this paper, we argue that the weighted-average method does not provide the best tradeoff for the standard performance and adversarial robustness. We argue that the failure of the weighted-average method is due to the conflict between the gradients derived from standard and adversarial loss, and further demonstrate such a conflict increases with attack budget theoretically and practically. To alleviate this problem, we propose a new trade-off paradigm for adversarial training with a conflict-aware factor for the convex combination of standard and adversarial loss, named \textbf{Conflict-Aware Adversarial Training~(CA-AT)}. Comprehensive experimental results show that CA-AT consistently offers a superior trade-off between standard performance and adversarial robustness under the settings of adversarial training from scratch and parameter-efficient finetuning.

Conflict-Aware Adversarial Training

TL;DR

A new trade-off paradigm for adversarial training with a conflict-aware factor for the convex combination of standard and adversarial loss, named Conflict-Aware Adversarial Training (CA-AT), which consistently offers a superior trade-off between standard performance and adversarial robustness.

Abstract

Adversarial training is the most effective method to obtain adversarial robustness for deep neural networks by directly involving adversarial samples in the training procedure. To obtain an accurate and robust model, the weighted-average method is applied to optimize standard loss and adversarial loss simultaneously. In this paper, we argue that the weighted-average method does not provide the best tradeoff for the standard performance and adversarial robustness. We argue that the failure of the weighted-average method is due to the conflict between the gradients derived from standard and adversarial loss, and further demonstrate such a conflict increases with attack budget theoretically and practically. To alleviate this problem, we propose a new trade-off paradigm for adversarial training with a conflict-aware factor for the convex combination of standard and adversarial loss, named \textbf{Conflict-Aware Adversarial Training~(CA-AT)}. Comprehensive experimental results show that CA-AT consistently offers a superior trade-off between standard performance and adversarial robustness under the settings of adversarial training from scratch and parameter-efficient finetuning.

Paper Structure

This paper contains 14 sections, 15 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: The key motivation of CA-AT aims to solve the conflict between clean gradient $g_{c}$ and adversarial gradient $g_{\text{a}}$. Unlike the existing weighted-averaged method optimizing model parameter $\theta$ by $g_{\circ}$ as the average of $g_{\text{c}}$ and $g_{\text{a}}$ (Vanilla AT), CA-AT utilizes $g_{*}$ for parameter optimization by gradient projection based on a new trade-off factor $\phi$. The bar chart on the right side illustrates that the model optimized by $g_{*}$ (highlighted as the boldface) can achieve better standard accuracy (blue bar) and adversarial accuracy (red bar) compared to models optimized by $g_{\circ}$. The results of the bar chart on the right are produced by training a ResNet18 on CIFAR10 against the PGD madry2017PGD attack.
  • Figure 2: The experimental results of conducting Vanilla AT with $\lambda=0.5$ for a binary classification task on our MNIST-crafted data. In \ref{['fig:Toy(a)']}, each subfigure is the tSNE hinton2002tsne visualization displaying the distribution of adversarial gradients ($g_{\text{a}}$) and standard gradients ($g_{\text{c}}$) for various training samples at the final epoch with different attack budgets ($\delta = [0.05,0.1,0.15,0.2,0.25,0.3]$). In \ref{['fig:Toy(b)']}, the upper bar chart shows the standard and adversarial accuracy on testing set with different $\delta$ similar to \ref{['fig:Toy(a)']}. The upper left line chart shows the relation between the $\mu=||g_{\text{c}}||_{2} \cdot ||g_{\text{a}}||_{2} \cdot (1-\cos(g_{\text{c}},g_{\text{a}}))$ and $\delta$, where the red line is the theoretical upper bound presented in Theorem 1. For decomposing $\mu$, lower bar chart shows the relation between $\delta$ and $||g_{c}||_{2}$/$||g_{a}||_{2}$/$(1-\cos(g_{a},g_{c}))$, respectively.
  • Figure 3: Results of gradient conflict metric $\mu$ on real-world datasets. \ref{['fig:Conflict_Difdata']} illustrates the results of $\mu$ among different real-world datasets (CIFAR10/CIFAR100) and model architectures (ResNet18/ResNet34), where the attack method used in AT is PGD. \ref{['fig:Conflict_Difatk']} shows the results of $\mu$ for different attack methods (AutoPGD/AutoPGD-DLR/T-AutoPGD-DLR) during AT, conducted on CIFAR10 with ResNet18.
  • Figure 4: SA-AA Fronts for Adversarial PEFT on Swin-T using Adapter.
  • Figure 5: SA-AA Fronts for Adversarial PEFT on ViT using Adapter on Stanford Dogs.
  • ...and 11 more figures