Table of Contents
Fetching ...

Improving White-box Robustness of Pre-processing Defenses via Joint Adversarial Training

Dawei Zhou, Nannan Wang, Xinbo Gao, Bo Han, Jun Yu, Xiaoyu Wang, Tongliang Liu

TL;DR

This work tackles the robustness degradation of input pre-processing defenses under white-box adaptive attacks by introducing Joint Adversarial Training based Pre-processing (JATP). JATP trains the pre-processing module with full-model adversarial examples and optimizes a hybrid loss that combines pixel-level fidelity with feature-space adversarial risk, plus a misclassification-aware regularization to improve cross-model transferability. Empirical results on SVHN and CIFAR-10 show that JATP reduces the degradation of adversarial robustness across multiple target models and achieves superior protection against diverse adaptive attacks compared to prior defenses. The approach advances practical white-box robustness for denoising pre-processing steps and suggests avenues for extending joint training to broader pre-processing defenses.

Abstract

Deep neural networks (DNNs) are vulnerable to adversarial noise. A range of adversarial defense techniques have been proposed to mitigate the interference of adversarial noise, among which the input pre-processing methods are scalable and show great potential to safeguard DNNs. However, pre-processing methods may suffer from the robustness degradation effect, in which the defense reduces rather than improving the adversarial robustness of a target model in a white-box setting. A potential cause of this negative effect is that adversarial training examples are static and independent to the pre-processing model. To solve this problem, we investigate the influence of full adversarial examples which are crafted against the full model, and find they indeed have a positive impact on the robustness of defenses. Furthermore, we find that simply changing the adversarial training examples in pre-processing methods does not completely alleviate the robustness degradation effect. This is due to the adversarial risk of the pre-processed model being neglected, which is another cause of the robustness degradation effect. Motivated by above analyses, we propose a method called Joint Adversarial Training based Pre-processing (JATP) defense. Specifically, we formulate a feature similarity based adversarial risk for the pre-processing model by using full adversarial examples found in a feature space. Unlike standard adversarial training, we only update the pre-processing model, which prompts us to introduce a pixel-wise loss to improve its cross-model transferability. We then conduct a joint adversarial training on the pre-processing model to minimize this overall risk. Empirical results show that our method could effectively mitigate the robustness degradation effect across different target models in comparison to previous state-of-the-art approaches.

Improving White-box Robustness of Pre-processing Defenses via Joint Adversarial Training

TL;DR

This work tackles the robustness degradation of input pre-processing defenses under white-box adaptive attacks by introducing Joint Adversarial Training based Pre-processing (JATP). JATP trains the pre-processing module with full-model adversarial examples and optimizes a hybrid loss that combines pixel-level fidelity with feature-space adversarial risk, plus a misclassification-aware regularization to improve cross-model transferability. Empirical results on SVHN and CIFAR-10 show that JATP reduces the degradation of adversarial robustness across multiple target models and achieves superior protection against diverse adaptive attacks compared to prior defenses. The approach advances practical white-box robustness for denoising pre-processing steps and suggests avenues for extending joint training to broader pre-processing defenses.

Abstract

Deep neural networks (DNNs) are vulnerable to adversarial noise. A range of adversarial defense techniques have been proposed to mitigate the interference of adversarial noise, among which the input pre-processing methods are scalable and show great potential to safeguard DNNs. However, pre-processing methods may suffer from the robustness degradation effect, in which the defense reduces rather than improving the adversarial robustness of a target model in a white-box setting. A potential cause of this negative effect is that adversarial training examples are static and independent to the pre-processing model. To solve this problem, we investigate the influence of full adversarial examples which are crafted against the full model, and find they indeed have a positive impact on the robustness of defenses. Furthermore, we find that simply changing the adversarial training examples in pre-processing methods does not completely alleviate the robustness degradation effect. This is due to the adversarial risk of the pre-processed model being neglected, which is another cause of the robustness degradation effect. Motivated by above analyses, we propose a method called Joint Adversarial Training based Pre-processing (JATP) defense. Specifically, we formulate a feature similarity based adversarial risk for the pre-processing model by using full adversarial examples found in a feature space. Unlike standard adversarial training, we only update the pre-processing model, which prompts us to introduce a pixel-wise loss to improve its cross-model transferability. We then conduct a joint adversarial training on the pre-processing model to minimize this overall risk. Empirical results show that our method could effectively mitigate the robustness degradation effect across different target models in comparison to previous state-of-the-art approaches.

Paper Structure

This paper contains 13 sections, 10 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: The visualization of robustness degradation effect. We evaluate the white-box robustness (accuracy on white-box adaptive attacks) of three pre-processing defenses: APE-G shen2017ape, HGD liao2018defense and NRP naseer2020self on CIFAR-10krizhevsky2009learning. The target models are adversarially trained via two adversarial training strategies: Standardmadry2017towards, and TRADESzhang2019theoretically. We combine the adaptive attack strategy with three attacks such as PGD madry2017towards, AA croce2020reliable and FWA wu2020stronger to craft adversarial examples. "None" denotes that no pre-processing defense is used. "Obl" denotes the pre-processing model trained using oblivious adversarial examples, and "Full" denotes the model trained using full adversarial examples.
  • Figure 2: (a). The distinctive influence of oblivious and full adversarial examples on CIFAR-10. (b). A visual illustration of natural examples, adversarial examples and pre-processed examples. The adversarial examples are crafted by an adaptive PGD attack.
  • Figure 3: A visual illustration of our Joint Adversarial Training based Pre-processing (JATP) defense. We use adversarial examples against the full model $\mathcal{F}$ to train a pre-processing model $\mathcal{P}$ that minimizes a hybrid loss composed of the pixel-wise loss $\mathcal{L}_{\mathbf{pix}}$ and the adversarial loss $\mathcal{L}_{\mathbf{adv}}$.
  • Figure 4: (a). Fooling rate (lower is better) of BPDA and PGD against pre-processing models. The target model is trained by TRADES. (b). Ablation study. We remove the pixel-wise loss ("Pix"), BCE adversarial loss ("BCE") and feature similarity adversarial loss ("FSM") respectively to investigate their impacts on our model.