Table of Contents
Fetching ...

Getting a-Round Guarantees: Floating-Point Attacks on Certified Robustness

Jiankai Jin, Olga Ohrimenko, Benjamin I. P. Rubinstein

TL;DR

A rounding search method is designed that can efficiently exploit this vulnerability to find adversarial examples against state-of-the-art certifications in two threat models, that differ in how the norm of the perturbation is computed.

Abstract

Adversarial examples pose a security risk as they can alter decisions of a machine learning classifier through slight input perturbations. Certified robustness has been proposed as a mitigation where given an input $\mathbf{x}$, a classifier returns a prediction and a certified radius $R$ with a provable guarantee that any perturbation to $\mathbf{x}$ with $R$-bounded norm will not alter the classifier's prediction. In this work, we show that these guarantees can be invalidated due to limitations of floating-point representation that cause rounding errors. We design a rounding search method that can efficiently exploit this vulnerability to find adversarial examples against state-of-the-art certifications in two threat models, that differ in how the norm of the perturbation is computed. We show that the attack can be carried out against linear classifiers that have exact certifiable guarantees and against neural networks that have conservative certifications. In the weak threat model, our experiments demonstrate attack success rates over 50% on random linear classifiers, up to 23% on the MNIST dataset for linear SVM, and up to 15% for a neural network. In the strong threat model, the success rates are lower but positive. The floating-point errors exploited by our attacks can range from small to large (e.g., $10^{-13}$ to $10^{3}$) - showing that even negligible errors can be systematically exploited to invalidate guarantees provided by certified robustness. Finally, we propose a formal mitigation approach based on rounded interval arithmetic, encouraging future implementations of robustness certificates to account for limitations of modern computing architecture to provide sound certifiable guarantees.

Getting a-Round Guarantees: Floating-Point Attacks on Certified Robustness

TL;DR

A rounding search method is designed that can efficiently exploit this vulnerability to find adversarial examples against state-of-the-art certifications in two threat models, that differ in how the norm of the perturbation is computed.

Abstract

Adversarial examples pose a security risk as they can alter decisions of a machine learning classifier through slight input perturbations. Certified robustness has been proposed as a mitigation where given an input , a classifier returns a prediction and a certified radius with a provable guarantee that any perturbation to with -bounded norm will not alter the classifier's prediction. In this work, we show that these guarantees can be invalidated due to limitations of floating-point representation that cause rounding errors. We design a rounding search method that can efficiently exploit this vulnerability to find adversarial examples against state-of-the-art certifications in two threat models, that differ in how the norm of the perturbation is computed. We show that the attack can be carried out against linear classifiers that have exact certifiable guarantees and against neural networks that have conservative certifications. In the weak threat model, our experiments demonstrate attack success rates over 50% on random linear classifiers, up to 23% on the MNIST dataset for linear SVM, and up to 15% for a neural network. In the strong threat model, the success rates are lower but positive. The floating-point errors exploited by our attacks can range from small to large (e.g., to ) - showing that even negligible errors can be systematically exploited to invalidate guarantees provided by certified robustness. Finally, we propose a formal mitigation approach based on rounded interval arithmetic, encouraging future implementations of robustness certificates to account for limitations of modern computing architecture to provide sound certifiable guarantees.
Paper Structure (35 sections, 2 theorems, 8 equations, 4 figures, 2 tables, 3 algorithms)

This paper contains 35 sections, 2 theorems, 8 equations, 4 figures, 2 tables, 3 algorithms.

Key Result

Lemma 1

Consider the interval arithmetic operators in Definition def:arithmetic-operators with the resulting lower (upper) interval limits computed using IEEE754 floating-point arithmetic with rounding down (up), then the resulting rounded interval arithmetic operators are sound floating-point extensions.

Figures (4)

  • Figure 1: The search direction $\boldsymbol{\nu}$ (blue line) and search area (green area) for finding adversarial examples against a model, whose decision boundary is the orange line. $\mathbf{x}$ is the original instance, $\tilde{R}$ and $R$ are the computed and real-valued certified radii of the model on $\mathbf{x}$, $\boldsymbol{\delta}=\tilde{R}\boldsymbol{\nu} / \|\boldsymbol{\nu}\|$ is the adversarial perturbation in the search direction $\boldsymbol{\nu}$, instance $\mathbf{x'}=\mathbf{x}+\boldsymbol{\delta}$ is the seed for the green search area. Our rounding search method will sample $N$ floating-point neighbors $\boldsymbol{\delta'}$ of $\boldsymbol{\delta}$, and evaluate each $\mathbf{x}+\boldsymbol{\delta'}$ to check if any one of them can flip the classification of the model with $\|\boldsymbol{\delta'}\| \le \tilde{R}$ or $\overline{\|\boldsymbol{\delta}'\|}\le\tilde{R}$ (the red points in the green search area).
  • Figure 2: (a) Rounding search attack success rates against a random binary linear classifier in both weak (W) and strong (S) threat models (Section \ref{['sec:exp:linear']}). For each dimension $D$, we report the percentage of 10 000 randomly initialized models for which we can successfully find an adversarial example within certified radius $\tilde{R}$ for a random instance $\mathbf{x}$ drawn from $[-1,1]^D$. Since the attacks are against an exact certified radius, the baseline attack rate should be $0\%$ in both weak and strong threat models. Model weights $\mathbf{w}$ and biases $b$ are randomly initialized with $\mathbf{w}\in[-1,1]^D$, $b\in[-1,1]$. All values and computation is done using either 32-bit or 64-bit floating points. (b) Maximum rounding error in the calculation of the certified radius $\tilde{R}$ on a sample $\mathbf{x}$ with each $x_i=3.3\times10^{9}$, for the linear model with $w_i=3.3\times10^{-9}$, $b=3.3\times10^9$, where $i \in [1,D]$ and $D\in[20,1000]$.
  • Figure 3: (a) Original images from the MNIST dataset. (b) Corresponding adversarial images in the weak threat model, with perturbations within the exact certified radius (i.e., $\|\boldsymbol{\delta}\| \le \tilde{R}$) but the linear SVM model misclassifies them. For example, the first top-left image in (a) has certified radius of $\tilde{R}=333.608776491892\emph{5}$, and is classified as 1, while the corresponding adversarial image in (b) has perturbation $\|\boldsymbol{\delta}\|=333.608776491892\emph{4}$, and is classified as 0. Labels at the bottom right of each image are the classifications of the linear SVM model (Section \ref{['sec:lrmodel']}).
  • Figure 4: Exploration of flattening success rate phenomenon in Section \ref{['sec:exp:linear']}. The experiment is done in the weak threat model with 64-bit floating-point representation. $\tilde{R}$ is the estimated certified radius using IEEE 754 floating-point arithmetic (with rounding errors), $\hat{R}$ is the certified radius found via binary search within $[\underline{R}, \overline{R}]$, $\underline{R}$ and $\overline{R}$ are the lower and upper bounds of the real certified radius $R$ estimated using interval arithmetic pyinterval. $\hat{R}$ is our best approximation to $R$, within which there are no robustness violations. (a) Deviations ($\hat{R}-\tilde{R}$) between $\tilde{R}$ and $\hat{R}$ over 10 000 trials for $D\in\{3,30\}$. (b) For each dimension $D\in[1,100]$, we plot percentage of 10 000 trials for which $\hat{R}>\tilde{R}$ and $\hat{R}\le\tilde{R}$. Our attacks may work when $\hat{R}<\tilde{R}$, that is, the certified radius is overestimated.

Theorems & Definitions (7)

  • Definition 1
  • Remark 1
  • Definition 2
  • Lemma 1
  • Theorem 1
  • proof
  • Remark 2