Adversarial Robustness Overestimation and Instability in TRADES
Jonathan Weiping Li, Ren-Wei Liang, Cheng-Han Yeh, Cheng-Chang Tsai, Kuanchun Yu, Chun-Shien Lu, Shang-Tse Chen
TL;DR
This work tackles probabilistic robustness overestimation in TRADES, showing that PGD-10 validation can outpace AutoAttack testing due to gradient masking in multiclass settings. By examining inner maximization, batch-level gradient information, and loss landscapes, the authors identify beta, batch size, and learning rate as key drivers of instability, particularly on more complex datasets. They observe a self-healing phenomenon and propose a real-time remedy that monitors the First-Order Stationary Condition and injects Gaussian noise when needed to mitigate overestimation without retraining. Experiments reveal that Adversarial Weight Perturbation is ineffective for this issue, underscoring the need for robust evaluation strategies and reliable mitigation techniques for TRADES and related defenses.
Abstract
This paper examines the phenomenon of probabilistic robustness overestimation in TRADES, a prominent adversarial training method. Our study reveals that TRADES sometimes yields disproportionately high PGD validation accuracy compared to the AutoAttack testing accuracy in the multiclass classification task. This discrepancy highlights a significant overestimation of robustness for these instances, potentially linked to gradient masking. We further analyze the parameters contributing to unstable models that lead to overestimation. Our findings indicate that smaller batch sizes, lower beta values (which control the weight of the robust loss term in TRADES), larger learning rates, and higher class complexity (e.g., CIFAR-100 versus CIFAR-10) are associated with an increased likelihood of robustness overestimation. By examining metrics such as the First-Order Stationary Condition (FOSC), inner-maximization, and gradient information, we identify the underlying cause of this phenomenon as gradient masking and provide insights into it. Furthermore, our experiments show that certain unstable training instances may return to a state without robust overestimation, inspiring our attempts at a solution. In addition to adjusting parameter settings to reduce instability or retraining when overestimation occurs, we recommend incorporating Gaussian noise in inputs when the FOSC score exceed the threshold. This method aims to mitigate robustness overestimation of TRADES and other similar methods at its source, ensuring more reliable representation of adversarial robustness during evaluation.
