Table of Contents
Fetching ...

A practical approach to evaluating the adversarial distance for machine learning classifiers

Georg Siedel, Ekagra Gupta, Andrey Morozov

TL;DR

This paper investigates the estimation of the more informative adversarial distance using iterative adversarial attacks and a certification approach, and finds that the adversarial attack approach is effective compared to related implementations, while the certification method falls short of expectations.

Abstract

Robustness is critical for machine learning (ML) classifiers to ensure consistent performance in real-world applications where models may encounter corrupted or adversarial inputs. In particular, assessing the robustness of classifiers to adversarial inputs is essential to protect systems from vulnerabilities and thus ensure safety in use. However, methods to accurately compute adversarial robustness have been challenging for complex ML models and high-dimensional data. Furthermore, evaluations typically measure adversarial accuracy on specific attack budgets, limiting the informative value of the resulting metrics. This paper investigates the estimation of the more informative adversarial distance using iterative adversarial attacks and a certification approach. Combined, the methods provide a comprehensive evaluation of adversarial robustness by computing estimates for the upper and lower bounds of the adversarial distance. We present visualisations and ablation studies that provide insights into how this evaluation method should be applied and parameterised. We find that our adversarial attack approach is effective compared to related implementations, while the certification method falls short of expectations. The approach in this paper should encourage a more informative way of evaluating the adversarial robustness of ML classifiers.

A practical approach to evaluating the adversarial distance for machine learning classifiers

TL;DR

This paper investigates the estimation of the more informative adversarial distance using iterative adversarial attacks and a certification approach, and finds that the adversarial attack approach is effective compared to related implementations, while the certification method falls short of expectations.

Abstract

Robustness is critical for machine learning (ML) classifiers to ensure consistent performance in real-world applications where models may encounter corrupted or adversarial inputs. In particular, assessing the robustness of classifiers to adversarial inputs is essential to protect systems from vulnerabilities and thus ensure safety in use. However, methods to accurately compute adversarial robustness have been challenging for complex ML models and high-dimensional data. Furthermore, evaluations typically measure adversarial accuracy on specific attack budgets, limiting the informative value of the resulting metrics. This paper investigates the estimation of the more informative adversarial distance using iterative adversarial attacks and a certification approach. Combined, the methods provide a comprehensive evaluation of adversarial robustness by computing estimates for the upper and lower bounds of the adversarial distance. We present visualisations and ablation studies that provide insights into how this evaluation method should be applied and parameterised. We find that our adversarial attack approach is effective compared to related implementations, while the certification method falls short of expectations. The approach in this paper should encourage a more informative way of evaluating the adversarial robustness of ML classifiers.
Paper Structure (18 sections, 3 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 18 sections, 3 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: A one-step attack and an iterative attack on the model with the red decision boundary visualize the difference between (a) finding some adversarial perturbation $x_{adversarial}$ using the full attack budget epsilon and (b) finding a good estimate for the minimal adversarial distance between $x_{adversarial}$ and $x$. In case (a), only a discrete 0-or-1 success/failure rate (adversarial risk/accuracy) of the attack can be reported, while (b) allows for a continuous robustness assessment.
  • Figure 2: Working principle of CLEVER score estimation
  • Figure 3: $L_{\infty}$ Distance of 50 images for different $\epsilon_{step}$ values (standard model)
  • Figure 4: Trade-off between tightness of the mean adversarial distance estimation and the computational effort for varying $\epsilon_{step}$
  • Figure 5: Comparison of all attacks, standard model, 20 images. Above: Mean Adversarial Distance and total runtime. Below: Image-wise.
  • ...and 1 more figures