Table of Contents
Fetching ...

Provably Minimally-Distorted Adversarial Examples

Nicholas Carlini, Guy Katz, Clark Barrett, David L. Dill

TL;DR

This work uses formal verification (Reluplex) to construct provably minimally distorted adversarial examples, enabling precise assessment of attack effectiveness and defense robustness on a small MNIST network. It shows that iterative attacks like CW approach the true minimum distortion (within about 6–12%), and that adversarial training can dramatically increase the distortion required for successful adversarial examples (roughly 4x). The authors acknowledge scalability constraints of current verifiers and advocate designing verification-friendly architectures to extend provable guarantees to larger systems. Overall, the paper demonstrates a concrete, verification-grounded framework for evaluating and improving adversarial robustness beyond empirical testing alone.

Abstract

The ability to deploy neural networks in real-world, safety-critical systems is severely limited by the presence of adversarial examples: slightly perturbed inputs that are misclassified by the network. In recent years, several techniques have been proposed for increasing robustness to adversarial examples --- and yet most of these have been quickly shown to be vulnerable to future attacks. For example, over half of the defenses proposed by papers accepted at ICLR 2018 have already been broken. We propose to address this difficulty through formal verification techniques. We show how to construct provably minimally distorted adversarial examples: given an arbitrary neural network and input sample, we can construct adversarial examples which we prove are of minimal distortion. Using this approach, we demonstrate that one of the recent ICLR defense proposals, adversarial retraining, provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4.2.

Provably Minimally-Distorted Adversarial Examples

TL;DR

This work uses formal verification (Reluplex) to construct provably minimally distorted adversarial examples, enabling precise assessment of attack effectiveness and defense robustness on a small MNIST network. It shows that iterative attacks like CW approach the true minimum distortion (within about 6–12%), and that adversarial training can dramatically increase the distortion required for successful adversarial examples (roughly 4x). The authors acknowledge scalability constraints of current verifiers and advocate designing verification-friendly architectures to extend provable guarantees to larger systems. Overall, the paper demonstrates a concrete, verification-grounded framework for evaluating and improving adversarial robustness beyond empirical testing alone.

Abstract

The ability to deploy neural networks in real-world, safety-critical systems is severely limited by the presence of adversarial examples: slightly perturbed inputs that are misclassified by the network. In recent years, several techniques have been proposed for increasing robustness to adversarial examples --- and yet most of these have been quickly shown to be vulnerable to future attacks. For example, over half of the defenses proposed by papers accepted at ICLR 2018 have already been broken. We propose to address this difficulty through formal verification techniques. We show how to construct provably minimally distorted adversarial examples: given an arbitrary neural network and input sample, we can construct adversarial examples which we prove are of minimal distortion. Using this approach, we demonstrate that one of the recent ICLR defense proposals, adversarial retraining, provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4.2.

Paper Structure

This paper contains 7 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure :
  • Figure :