Table of Contents
Fetching ...

Logit Pairing Methods Can Fool Gradient-Based Attacks

Marius Mosbach, Maksym Andriushchenko, Thomas Trost, Matthias Hein, Dietrich Klakow

TL;DR

The paper challenges the robustness claims of logit pairing defenses (CLP, LSQ, ALP) by showing that they mainly distort the input-space loss surface and obfuscate gradients rather than provide true robustness. Through extensive PGD-based evaluations across MNIST, CIFAR-10, and Tiny ImageNet, it demonstrates that results are highly sensitive to attack parameters and restarts, with CLP/LSQ failing to offer real protection and ALP giving only modest gains when combined with adversarial training. The authors advocate exhaustive grid searches over PGD parameters and many restarts to avoid false conclusions, and conclude that ALP's improvements are not substantially better than adversarial training alone. Overall, the work cautions against relying on default PGD settings to assess robustness and emphasizes dataset-dependent outcomes for logit pairing methods.

Abstract

Recently, Kannan et al. [2018] proposed several logit regularization methods to improve the adversarial robustness of classifiers. We show that the computationally fast methods they propose - Clean Logit Pairing (CLP) and Logit Squeezing (LSQ) - just make the gradient-based optimization problem of crafting adversarial examples harder without providing actual robustness. We find that Adversarial Logit Pairing (ALP) may indeed provide robustness against adversarial examples, especially when combined with adversarial training, and we examine it in a variety of settings. However, the increase in adversarial accuracy is much smaller than previously claimed. Finally, our results suggest that the evaluation against an iterative PGD attack relies heavily on the parameters used and may result in false conclusions regarding robustness of a model.

Logit Pairing Methods Can Fool Gradient-Based Attacks

TL;DR

The paper challenges the robustness claims of logit pairing defenses (CLP, LSQ, ALP) by showing that they mainly distort the input-space loss surface and obfuscate gradients rather than provide true robustness. Through extensive PGD-based evaluations across MNIST, CIFAR-10, and Tiny ImageNet, it demonstrates that results are highly sensitive to attack parameters and restarts, with CLP/LSQ failing to offer real protection and ALP giving only modest gains when combined with adversarial training. The authors advocate exhaustive grid searches over PGD parameters and many restarts to avoid false conclusions, and conclude that ALP's improvements are not substantially better than adversarial training alone. Overall, the work cautions against relying on default PGD settings to assess robustness and emphasizes dataset-dependent outcomes for logit pairing methods.

Abstract

Recently, Kannan et al. [2018] proposed several logit regularization methods to improve the adversarial robustness of classifiers. We show that the computationally fast methods they propose - Clean Logit Pairing (CLP) and Logit Squeezing (LSQ) - just make the gradient-based optimization problem of crafting adversarial examples harder without providing actual robustness. We find that Adversarial Logit Pairing (ALP) may indeed provide robustness against adversarial examples, especially when combined with adversarial training, and we examine it in a variety of settings. However, the increase in adversarial accuracy is much smaller than previously claimed. Finally, our results suggest that the evaluation against an iterative PGD attack relies heavily on the parameters used and may result in false conclusions regarding robustness of a model.

Paper Structure

This paper contains 10 sections, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Input loss surfaces of MNIST models in a random subspace around an input image with $\epsilon=38.25$. We can clearly see a distorted loss surface for the logit regularization methods, which can prevent gradient-based attacks from succeeding. Additional visualizations are found in Figures \ref{['fig:mnist_clp_loss_surfaces_appendix']}, \ref{['fig:mnist_alp_loss_surfaces_appendix']}, \ref{['fig:cifar10_clp_loss_surfaces_appendix']}, and \ref{['fig:cifar10_alp_loss_surfaces_appendix']} in the Appendix.
  • Figure 2: Heatmaps of the adversarial accuracy for 100% adversarial training madry2017towards, LSQ, and CLP models trained on MNIST for different settings of step size $\epsilon_i$ and number of iterations $n$ when running the PGD attack with $\epsilon = 76.5$. Heatmaps for other models can be found in Figures \ref{['fig:grid_search_at']} and \ref{['fig:grid_search_alp']} in the Appendix. For all heatmaps, the adversarial accuracy was evaluated on 1000 points drawn randomly from the test data.
  • Figure 3: Histograms of the loss values for a single point for $10000$ random restarts of the PGD attack for CLP model trained on MNIST. We show 4 typical cases, which illustrate that there exist points for which the loss can be successfully maximized only with a good starting point. The vertical red line denotes the loss value of $- \ln(0.1)$, which guarantees that for this and higher values of the loss an adversarial example can be found. More histograms can be found in Figure \ref{['fig:histograms_appendix']} in the Appendix.
  • Figure 4: Input loss surfaces of CLP model on MNIST in a random subspace with $\epsilon=38.25$ for the first eight test examples. The loss surface contains many local maxima and hence makes gradient-based attacks much more difficult. This is in line with our quantitative results in Table \ref{['table:mnist']}, showing that this model does not provide actual robustness and that the gradient-based PGD attack must use many random restarts to successfully craft adversarial examples.
  • Figure 5: Input loss surfaces of Plain + ALP model on MNIST in a random subspace with $\epsilon=38.25$ for the first eight test examples. The loss surface has a local maximum at the input point. At the same time, our quantitative results in Table \ref{['table:mnist']} show that this model is resistant even to our strongest attack with many random restarts of PGD attack.
  • ...and 5 more figures