Table of Contents
Fetching ...

ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without Training Substitute Models

Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, Cho-Jui Hsieh

TL;DR

The paper tackles the challenge of crafting strong adversarial examples against deep neural networks in a black-box setting without training substitute models. It introduces ZOO, a zeroth-order optimization framework using coordinate descent, attack-space dimension reduction, hierarchical progression, and importance sampling to estimate gradients from input-output queries. ZOO achieves performance comparable to the white-box Carlini & Wagner attack on MNIST and CIFAR-10 and significantly outperforms substitute-model black-box methods, with demonstrated scalability to ImageNet. This approach enables practical, model-agnostic black-box attacks and highlights directions for accelerated attacks and adversarial training in defense strategies.

Abstract

Deep neural networks (DNNs) are one of the most prominent technologies of our time, as they achieve state-of-the-art performance in many machine learning tasks, including but not limited to image classification, text mining, and speech processing. However, recent research on DNNs has indicated ever-increasing concern on the robustness to adversarial examples, especially for security-critical tasks such as traffic sign identification for autonomous driving. Studies have unveiled the vulnerability of a well-trained DNN by demonstrating the ability of generating barely noticeable (to both human and machines) adversarial images that lead to misclassification. Furthermore, researchers have shown that these adversarial images are highly transferable by simply training and attacking a substitute model built upon the target model, known as a black-box attack to DNNs. Similar to the setting of training substitute models, in this paper we propose an effective black-box attack that also only has access to the input (images) and the output (confidence scores) of a targeted DNN. However, different from leveraging attack transferability from substitute models, we propose zeroth order optimization (ZOO) based attacks to directly estimate the gradients of the targeted DNN for generating adversarial examples. We use zeroth order stochastic coordinate descent along with dimension reduction, hierarchical attack and importance sampling techniques to efficiently attack black-box models. By exploiting zeroth order optimization, improved attacks to the targeted DNN can be accomplished, sparing the need for training substitute models and avoiding the loss in attack transferability. Experimental results on MNIST, CIFAR10 and ImageNet show that the proposed ZOO attack is as effective as the state-of-the-art white-box attack and significantly outperforms existing black-box attacks via substitute models.

ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without Training Substitute Models

TL;DR

The paper tackles the challenge of crafting strong adversarial examples against deep neural networks in a black-box setting without training substitute models. It introduces ZOO, a zeroth-order optimization framework using coordinate descent, attack-space dimension reduction, hierarchical progression, and importance sampling to estimate gradients from input-output queries. ZOO achieves performance comparable to the white-box Carlini & Wagner attack on MNIST and CIFAR-10 and significantly outperforms substitute-model black-box methods, with demonstrated scalability to ImageNet. This approach enables practical, model-agnostic black-box attacks and highlights directions for accelerated attacks and adversarial training in defense strategies.

Abstract

Deep neural networks (DNNs) are one of the most prominent technologies of our time, as they achieve state-of-the-art performance in many machine learning tasks, including but not limited to image classification, text mining, and speech processing. However, recent research on DNNs has indicated ever-increasing concern on the robustness to adversarial examples, especially for security-critical tasks such as traffic sign identification for autonomous driving. Studies have unveiled the vulnerability of a well-trained DNN by demonstrating the ability of generating barely noticeable (to both human and machines) adversarial images that lead to misclassification. Furthermore, researchers have shown that these adversarial images are highly transferable by simply training and attacking a substitute model built upon the target model, known as a black-box attack to DNNs. Similar to the setting of training substitute models, in this paper we propose an effective black-box attack that also only has access to the input (images) and the output (confidence scores) of a targeted DNN. However, different from leveraging attack transferability from substitute models, we propose zeroth order optimization (ZOO) based attacks to directly estimate the gradients of the targeted DNN for generating adversarial examples. We use zeroth order stochastic coordinate descent along with dimension reduction, hierarchical attack and importance sampling techniques to efficiently attack black-box models. By exploiting zeroth order optimization, improved attacks to the targeted DNN can be accomplished, sparing the need for training substitute models and avoiding the loss in attack transferability. Experimental results on MNIST, CIFAR10 and ImageNet show that the proposed ZOO attack is as effective as the state-of-the-art white-box attack and significantly outperforms existing black-box attacks via substitute models.

Paper Structure

This paper contains 19 sections, 8 equations, 7 figures, 3 tables, 3 algorithms.

Figures (7)

  • Figure 1: Visual illustration of our proposed black-box attack (ZOO) to sampled images from ImageNet. The columns from left to right are original images with correct labels, additive adversarial noises from our attack, and crafted adversarial images with misclassified labels.
  • Figure 2: Taxonomy of adversarial attacks to deep neural networks (DNNs). "Back propagation" means an attacker can access the internal configurations in DNNs (e.g., performing gradient descent), and "Query" means an attacker can input any sample and observe the corresponding output.
  • Figure 3: Attacking the bagel image in Figure \ref{['Fig_imagenet']} (a) with importance sampling. Top: Pixel values in certain parts of the bagel image have significant changes in RGB channels, and the changes in the R channel is more prominent than other channels. Here the attack-space is $32 \times 32 \times 3$. Although our targeted attack in this attack-space fails, its adversarial noise provides important clues to pixel importance. We use the noise from this attack-space to sample important pixels after we increase the dimension of attack-space to a larger dimension. Bottom: Importance sampling probability distribution for $64 \times 64 \times 3$ attack-space. The importance is computed by taking the absolute value of pixel value changes, running a $4 \times 4$ max-pooling for each channel, up-sampling to the dimension of $64 \times 64 \times 3$, and normalizing all values.
  • Figure 4: Visual comparison of successful adversarial examples in MNIST. Each row displays crafted adversarial examples from the sampled images in (a). Each column in (b) to (d) indexes the targeted class for attack (digits 0 to 9).
  • Figure 5: Visual comparison of successful adversarial examples in CIFAR10. Each row displays crafted adversarial examples from the sampled images in (a). Each column in (b) to (d) indexes the targeted class for attack.
  • ...and 2 more figures