Table of Contents
Fetching ...

Adversarial Robustness Overestimation and Instability in TRADES

Jonathan Weiping Li, Ren-Wei Liang, Cheng-Han Yeh, Cheng-Chang Tsai, Kuanchun Yu, Chun-Shien Lu, Shang-Tse Chen

TL;DR

This work tackles probabilistic robustness overestimation in TRADES, showing that PGD-10 validation can outpace AutoAttack testing due to gradient masking in multiclass settings. By examining inner maximization, batch-level gradient information, and loss landscapes, the authors identify beta, batch size, and learning rate as key drivers of instability, particularly on more complex datasets. They observe a self-healing phenomenon and propose a real-time remedy that monitors the First-Order Stationary Condition and injects Gaussian noise when needed to mitigate overestimation without retraining. Experiments reveal that Adversarial Weight Perturbation is ineffective for this issue, underscoring the need for robust evaluation strategies and reliable mitigation techniques for TRADES and related defenses.

Abstract

This paper examines the phenomenon of probabilistic robustness overestimation in TRADES, a prominent adversarial training method. Our study reveals that TRADES sometimes yields disproportionately high PGD validation accuracy compared to the AutoAttack testing accuracy in the multiclass classification task. This discrepancy highlights a significant overestimation of robustness for these instances, potentially linked to gradient masking. We further analyze the parameters contributing to unstable models that lead to overestimation. Our findings indicate that smaller batch sizes, lower beta values (which control the weight of the robust loss term in TRADES), larger learning rates, and higher class complexity (e.g., CIFAR-100 versus CIFAR-10) are associated with an increased likelihood of robustness overestimation. By examining metrics such as the First-Order Stationary Condition (FOSC), inner-maximization, and gradient information, we identify the underlying cause of this phenomenon as gradient masking and provide insights into it. Furthermore, our experiments show that certain unstable training instances may return to a state without robust overestimation, inspiring our attempts at a solution. In addition to adjusting parameter settings to reduce instability or retraining when overestimation occurs, we recommend incorporating Gaussian noise in inputs when the FOSC score exceed the threshold. This method aims to mitigate robustness overestimation of TRADES and other similar methods at its source, ensuring more reliable representation of adversarial robustness during evaluation.

Adversarial Robustness Overestimation and Instability in TRADES

TL;DR

This work tackles probabilistic robustness overestimation in TRADES, showing that PGD-10 validation can outpace AutoAttack testing due to gradient masking in multiclass settings. By examining inner maximization, batch-level gradient information, and loss landscapes, the authors identify beta, batch size, and learning rate as key drivers of instability, particularly on more complex datasets. They observe a self-healing phenomenon and propose a real-time remedy that monitors the First-Order Stationary Condition and injects Gaussian noise when needed to mitigate overestimation without retraining. Experiments reveal that Adversarial Weight Perturbation is ineffective for this issue, underscoring the need for robust evaluation strategies and reliable mitigation techniques for TRADES and related defenses.

Abstract

This paper examines the phenomenon of probabilistic robustness overestimation in TRADES, a prominent adversarial training method. Our study reveals that TRADES sometimes yields disproportionately high PGD validation accuracy compared to the AutoAttack testing accuracy in the multiclass classification task. This discrepancy highlights a significant overestimation of robustness for these instances, potentially linked to gradient masking. We further analyze the parameters contributing to unstable models that lead to overestimation. Our findings indicate that smaller batch sizes, lower beta values (which control the weight of the robust loss term in TRADES), larger learning rates, and higher class complexity (e.g., CIFAR-100 versus CIFAR-10) are associated with an increased likelihood of robustness overestimation. By examining metrics such as the First-Order Stationary Condition (FOSC), inner-maximization, and gradient information, we identify the underlying cause of this phenomenon as gradient masking and provide insights into it. Furthermore, our experiments show that certain unstable training instances may return to a state without robust overestimation, inspiring our attempts at a solution. In addition to adjusting parameter settings to reduce instability or retraining when overestimation occurs, we recommend incorporating Gaussian noise in inputs when the FOSC score exceed the threshold. This method aims to mitigate robustness overestimation of TRADES and other similar methods at its source, ensuring more reliable representation of adversarial robustness during evaluation.

Paper Structure

This paper contains 23 sections, 4 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of the data loss landscapes for regular and unstable cases under the same configuration. Following the same setting of wu2024annealingengstrom2018evaluatingchen2021robust, we plot the loss landscape function $z = \text{loss}(x \cdot r_1 + y \cdot r_2)$, where $r_1 = \text{sign}(\nabla_i f(i))$ ($i$ is the input data) and $r_2 \sim \text{Rademacher}(0.5)$. The $x$ and $y$ axes represent the magnitude of the perturbation added in each direction and the $z$ axis represents the loss. One can observe that the loss landscape of the unstable case is highly rugged, which is not expected for a robust model.
  • Figure 2: Comparison between the FOSC, SGCS, and PGD-10 validation accuracy. Under the same configuration but with different seeds, the values for the regular case are displayed in (a), while the unstable case is shown in (b). Note that for clearer visualization, we scaled FOSC up by 10. Although the PGD-10 validation accuracy shows only slight fluctuations, both FOSC and SGCS exhibit significant changes within the same epoch, indicating that we can indeed observe gradient masking through this relation.
  • Figure 3: Relationship between FOSC and the gap between the clean training accuracy and adversarial training accuracy. Note that adversarial training accuracy is measured using TPGD. This demonstrates that TPGD, as the training adversary, may cause the model over-fitting to gradient-based attacks, making it difficult for the adversarial example to converge to a good condition.
  • Figure 4: Batch-level gradient metric (detailed definition in Appendix \ref{['Gradient_Norm']}) of the unstable case at epoch 102, where FOSC starts to rise sharply in Figure \ref{['fig:gap']}. (With a batch size of 256, each epoch updates 50,000 / 256 = 195 steps. Therefore, epoch 102 corresponds to steps, ranging from 19,891 to 20,085.) We can see the correlation between W_Grad_Norm, KL_Norm, and grad_cosine_similarity.
  • Figure 5: FOSC, training clean accuracy (train_acc), and weight gradient norm (W_grad_norm) of the self-healing case under the same configuration but with a different seed from the training instance in Figure \ref{['fig:gap']}. Note that for clearer visualization, we scaled FOSC up by 10 and scaled the W Grad Norm down by 0.3. And it is important to clarify that a decline in clean training accuracy, accompanied by a drop in FOSC to nearly zero, occurs at epoch 142. Subsequently, the weight gradient norm decreases at epoch 143.
  • ...and 2 more figures