Wasserstein distributional adversarial training for deep neural networks
Xingjian Bai, Guangyi He, Yifan Jiang, Jan Obloj
TL;DR
The paper tackles adversarial robustness under distributional threats by extending TRADES to Wasserstein distributionally robust optimization (W-DRO) and derives a first-order sensitivity-based approximation to enable practical training. It introduces a W-DRO reformulation and a W-PGD-budget attack strategy, along with a fine-tuning protocol (randomize last layer or apply small perturbations) to upgrade pre-trained models without sacrificing existing pointwise robustness. Empirical validation on multiple RobustBench networks trained on CIFAR-10 demonstrates consistent improvements in Wasserstein distributional robustness, with varying gains depending on prior training data scales; some gains persist even when fine-tuning only on the original data. The work provides a scalable, relatively inexpensive approach to bolster distributional adversarial defenses, offering guidance for applying W-DRO fine-tuning to existing models in practice.
Abstract
Design of adversarial attacks for deep neural networks, as well as methods of adversarial training against them, are subject of intense research. In this paper, we propose methods to train against distributional attack threats, extending the TRADES method used for pointwise attacks. Our approach leverages recent contributions and relies on sensitivity analysis for Wasserstein distributionally robust optimization problems. We introduce an efficient fine-tuning method which can be deployed on a previously trained model. We test our methods on a range of pre-trained models on RobustBench. These experimental results demonstrate the additional training enhances Wasserstein distributional robustness, while maintaining original levels of pointwise robustness, even for already very successful networks. The improvements are less marked for models pre-trained using huge synthetic datasets of 20-100M images. However, remarkably, sometimes our methods are still able to improve their performance even when trained using only the original training dataset (50k images).
