Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals

Rui Zheng; Yuhao Zhou; Zhiheng Xi; Tao Gui; Qi Zhang; Xuanjing Huang

Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals

Rui Zheng, Yuhao Zhou, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

This paper empirically shows that the features of either clean signals or adversarial perturbations are redundant and span in low-dimensional linear subspaces respectively with minimal overlap, and the classical low-dimensional subspace projection can suppress perturbation features out of the subspace of clean signals.

Abstract

Deep neural networks (DNNs) are notoriously vulnerable to adversarial attacks that place carefully crafted perturbations on normal examples to fool DNNs. To better understand such attacks, a characterization of the features carried by adversarial examples is needed. In this paper, we tackle this challenge by inspecting the subspaces of sample features through spectral analysis. We first empirically show that the features of either clean signals or adversarial perturbations are redundant and span in low-dimensional linear subspaces respectively with minimal overlap, and the classical low-dimensional subspace projection can suppress perturbation features out of the subspace of clean signals. This makes it possible for DNNs to learn a subspace where only features of clean signals exist while those of perturbations are discarded, which can facilitate the distinction of adversarial examples. To prevent the residual perturbations that is inevitable in subspace learning, we propose an independence criterion to disentangle clean signals from perturbations. Experimental results show that the proposed strategy enables the model to inherently suppress adversaries, which not only boosts model robustness but also motivates new directions of effective adversarial defense.

Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals

TL;DR

Abstract

Paper Structure (25 sections, 10 equations, 4 figures, 4 tables)

This paper contains 25 sections, 10 equations, 4 figures, 4 tables.

Introduction
Related Work
Textual Adversarial Attack
Textual Adversarial Defense
Spectral Analysis in Feature Space
Threat Model
Spectral Analysis
Subspace Projection
Proposed Method
Subspace Learning Module
Hilbert-Schmidt Independence Criterion
Model Training
Experiments
Datasets
Baselines
...and 10 more sections

Figures (4)

Figure 1: (a) Spectral analysis of features of clean signals, adversarial perturbations, adversarial examples on SST-2. (b) and (c) respectively show the accuracy $(\%)$ and robustness evaluation (accuracy under TextFooler attack) after projecting the perturbed features on $p$-clean signal subspace.
Figure 2: Averaged feature magnitudes of clean signals, adversarial examples and their corresponding projected counterparts. Low-dimensional $(p=2)$ clean subspace projector acts like a noise filter to eliminate the high feature magnitudes introduced by adversarial perturbations.
Figure 3: Accuracy and robustness evaluation (accuracy under TextFooler attack) of model under different subspace dimension, both of which reach the peak when the subspace dimension is between 5 and 10.
Figure 4: Robustness of each epoch throughout training on SST-2 and AGNews with different training strategy. Compared to adversarial training methods like PGD and FreeLB, the proposed subspace defense speeds up robust training and converges much faster in terms of accuracy under attack.

Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals

TL;DR

Abstract

Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals

Authors

TL;DR

Abstract

Table of Contents

Figures (4)