Table of Contents
Fetching ...

Do Counterfactual Examples Complicate Adversarial Training?

Eric Yeats, Cameron Darwin, Eduardo Ortega, Frank Liu, Hai Li

TL;DR

This work probes the robustness-accuracy tradeoff in adversarial training by generating low-norm counterfactuals (CEs) with a diffusion-based method. It shows that robust models' confidence and accuracy on clean training data are tied to how close the data are to CEs, and that these models perform poorly on CEs themselves, indicating invariance to subtle semantic changes. The study reveals a notable overlap between non-robust and semantically meaningful features, challenging the assumption that non-robust features are uninterpretable and suggesting that robustness approaches must account for semantic feature invariance. Overall, the diffusion CE approach provides a new lens to understand robustness and motivates alternative training strategies to mitigate the tradeoffs.

Abstract

We leverage diffusion models to study the robustness-performance tradeoff of robust classifiers. Our approach introduces a simple, pretrained diffusion method to generate low-norm counterfactual examples (CEs): semantically altered data which results in different true class membership. We report that the confidence and accuracy of robust models on their clean training data are associated with the proximity of the data to their CEs. Moreover, robust models perform very poorly when evaluated on the CEs directly, as they become increasingly invariant to the low-norm, semantic changes brought by CEs. The results indicate a significant overlap between non-robust and semantic features, countering the common assumption that non-robust features are not interpretable.

Do Counterfactual Examples Complicate Adversarial Training?

TL;DR

This work probes the robustness-accuracy tradeoff in adversarial training by generating low-norm counterfactuals (CEs) with a diffusion-based method. It shows that robust models' confidence and accuracy on clean training data are tied to how close the data are to CEs, and that these models perform poorly on CEs themselves, indicating invariance to subtle semantic changes. The study reveals a notable overlap between non-robust and semantically meaningful features, challenging the assumption that non-robust features are uninterpretable and suggesting that robustness approaches must account for semantic feature invariance. Overall, the diffusion CE approach provides a new lens to understand robustness and motivates alternative training strategies to mitigate the tradeoffs.

Abstract

We leverage diffusion models to study the robustness-performance tradeoff of robust classifiers. Our approach introduces a simple, pretrained diffusion method to generate low-norm counterfactual examples (CEs): semantically altered data which results in different true class membership. We report that the confidence and accuracy of robust models on their clean training data are associated with the proximity of the data to their CEs. Moreover, robust models perform very poorly when evaluated on the CEs directly, as they become increasingly invariant to the low-norm, semantic changes brought by CEs. The results indicate a significant overlap between non-robust and semantic features, countering the common assumption that non-robust features are not interpretable.
Paper Structure (14 sections, 12 equations, 15 figures, 2 tables)

This paper contains 14 sections, 12 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Conceptual depiction of the relationship between accuracy of robustly trained models with proximity to counterfactual examples (CEs). Stronger adversarial training inevitably leads to misclassification of some clean training data, incurring downstream test performance loss. We hypothesize that adversarially trained models are forced to become invariant to some semantic features due to the nearby presence of true CEs.
  • Figure 2: CE distribution comparison. Boltzmann variant CEs produce lower-norm, sparser changes. Best viewed in color.
  • Figure 3: Scatter plots of classifier confidence and average CE distance of 10000 clean training samples as adversarial training norm is increased. Robust models are more likely to misclassify and lose confidence on data which have closer CEs. Best viewed in color.
  • Figure 4: Classifier performance on 10000 training data and 200000 CE data generated from the training samples. Best viewed in color.
  • Figure 5: Distance of different-class CEs generated by the Boltzmann method ($w=15\ \sigma_{CE}=0.2$) when the input data is the original CIFAR10 train data, CEs generated by a robust $\varepsilon=1$ model from the CIFAR10 train data, and CEs generated by a robust $\varepsilon=2$ model from the CIFAR10 train data. Robust model CEs tend to be in data regions farther away from our diffusion-generated CEs. Best viewed in color.
  • ...and 10 more figures