Adversarial Examples Are Not Real Features

Ang Li; Yifei Wang; Yiwen Guo; Yisen Wang

Adversarial Examples Are Not Real Features

Ang Li, Yifei Wang, Yiwen Guo, Yisen Wang

TL;DR

This work reexamines the robust vs. non-robust feature framework for adversarial examples by evaluating four learning paradigms (SL, CL, MIM, DM) and formalizing cross-paradign usefulness and robustness. It defines $CU_{\mathcal{T}}(g,\mathcal{D}) = \min_{T \in \mathcal{T}} U(g,\mathcal{D},T)$, $CR_{\mathcal{T}}(g,\mathcal{D}) = \min_{T \in \mathcal{T}} R(g,\mathcal{D},T)$, and relative metrics $RU$ and $RR$ to compare across paradigms. Empirically, robust features transfer well across paradigms and resemble natural features in usefulness, while non-robust features are largely ineffective cross-paradigm and behave as paradigm-wise shortcuts; adversarial examples show limited cross-paradigm transfer. The findings indicate that robustness obtained from a single robust dataset does not guarantee universal robustness, suggesting that combining multiple learning paradigms during adversarial training may be necessary for true robustness.

Abstract

The existence of adversarial examples has been a mystery for years and attracted much interest. A well-known theory by \citet{ilyas2019adversarial} explains adversarial vulnerability from a data perspective by showing that one can extract non-robust features from adversarial examples and these features alone are useful for classification. However, the explanation remains quite counter-intuitive since non-robust features are mostly noise features to humans. In this paper, we re-examine the theory from a larger context by incorporating multiple learning paradigms. Notably, we find that contrary to their good usefulness under supervised learning, non-robust features attain poor usefulness when transferred to other self-supervised learning paradigms, such as contrastive learning, masked image modeling, and diffusion models. It reveals that non-robust features are not really as useful as robust or natural features that enjoy good transferability between these paradigms. Meanwhile, for robustness, we also show that naturally trained encoders from robust features are largely non-robust under AutoAttack. Our cross-paradigm examination suggests that the non-robust features are not really useful but more like paradigm-wise shortcuts, and robust features alone might be insufficient to attain reliable model robustness. Code is available at \url{https://github.com/PKU-ML/AdvNotRealFeatures}.

Adversarial Examples Are Not Real Features

TL;DR

, and relative metrics

and

to compare across paradigms. Empirically, robust features transfer well across paradigms and resemble natural features in usefulness, while non-robust features are largely ineffective cross-paradigm and behave as paradigm-wise shortcuts; adversarial examples show limited cross-paradigm transfer. The findings indicate that robustness obtained from a single robust dataset does not guarantee universal robustness, suggesting that combining multiple learning paradigms during adversarial training may be necessary for true robustness.

Abstract

Paper Structure (24 sections, 14 equations, 3 figures, 10 tables)

This paper contains 24 sections, 14 equations, 3 figures, 10 tables.

Introduction
A Cross-Paradigm View of Robust and Non-robust Features
Background
Cross-Paradigm Notions of Robust and Non-Robust Features
Cross-Paradigm Usefulness of Robust and Non-robust Features
Setup
Non-robust Features are Not Cross-paradigmly Useful
Cross-Paradigm Robustness of Robust Features
Setup
Robust Features are Not Robust both In-Paradigm and Cross-Paradigm
Cross-Paradigm Transferability of Adversarial Attacks
Cross-Paradigm Transferability of Attack Objectives
Cross-Paradigm Transferability of Backbone Encoders
Relationship to Natural Transferability between Paradigms
Conclusion
...and 9 more sections

Figures (3)

Figure 1: Tiny-ImageNet instances containing natural, robust, and non-robust features, respectively. The robust and non-robust instances are generated following the iterative optimization procedure in ilyas2019adversarial from random noise. The robust features are semantically aligned to natural images, while the non-robust features are always noise-like.
Figure 2: The change of loss value v.s. the attack iteration steps when using different attack objectives, CE loss (blue lines) or InfoNCE loss (orange lines), and backbones, trained by SL (Figure \ref{['Fig2.sub.1']} & Figure \ref{['Fig2.sub.2']}) or CL (Figure \ref{['Fig2.sub.3']} & Figure \ref{['Fig2.sub.4']}) from different paradigms.
Figure 3: The cross-paradigm robustness of adversarial examples generated with encoders from different learning paradigms. The $(A,B)$-th cell represents the accuracy of adversarial examples generated with an $A$-paradigm model (encoder with linear head) when evaluated on a $B$-paradigm model (encoder with linear head). Darker colors (i.e., higher accuracy) indicate worse transferability of adversarial examples.

Adversarial Examples Are Not Real Features

TL;DR

Abstract

Adversarial Examples Are Not Real Features

Authors

TL;DR

Abstract

Table of Contents

Figures (3)