Do Counterfactual Examples Complicate Adversarial Training?
Eric Yeats, Cameron Darwin, Eduardo Ortega, Frank Liu, Hai Li
TL;DR
This work probes the robustness-accuracy tradeoff in adversarial training by generating low-norm counterfactuals (CEs) with a diffusion-based method. It shows that robust models' confidence and accuracy on clean training data are tied to how close the data are to CEs, and that these models perform poorly on CEs themselves, indicating invariance to subtle semantic changes. The study reveals a notable overlap between non-robust and semantically meaningful features, challenging the assumption that non-robust features are uninterpretable and suggesting that robustness approaches must account for semantic feature invariance. Overall, the diffusion CE approach provides a new lens to understand robustness and motivates alternative training strategies to mitigate the tradeoffs.
Abstract
We leverage diffusion models to study the robustness-performance tradeoff of robust classifiers. Our approach introduces a simple, pretrained diffusion method to generate low-norm counterfactual examples (CEs): semantically altered data which results in different true class membership. We report that the confidence and accuracy of robust models on their clean training data are associated with the proximity of the data to their CEs. Moreover, robust models perform very poorly when evaluated on the CEs directly, as they become increasingly invariant to the low-norm, semantic changes brought by CEs. The results indicate a significant overlap between non-robust and semantic features, countering the common assumption that non-robust features are not interpretable.
