Table of Contents
Fetching ...

Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

Stephen Casper, Kaivalya Hariharan, Dylan Hadfield-Menell

TL;DR

The paper presents SNAFUE, an automated method to diagnose DNN weaknesses by discovering natural adversarial features through embeddings of synthetic and natural patches. It demonstrates scalable, automated red-teaming on ImageNet copy/paste attacks, reproducing prior results and uncovering hundreds of additional vulnerabilities with interpretable features. The approach combines latent-space adversarial patches, cosine similarity screening, and selective natural patches to reveal human-describable failure modes, offering a practical tool for scalable AI oversight. The work highlights both the potential and limitations of automated interpretability, and provides a path toward safer deployment and further NLP extensions.

Abstract

This paper considers the problem of helping humans exercise scalable oversight over deep neural networks (DNNs). Adversarial examples can be useful by helping to reveal weaknesses in DNNs, but they can be difficult to interpret or draw actionable conclusions from. Some previous works have proposed using human-interpretable adversarial attacks including copy/paste attacks in which one natural image pasted into another causes an unexpected misclassification. We build on these with two contributions. First, we introduce Search for Natural Adversarial Features Using Embeddings (SNAFUE) which offers a fully automated method for finding copy/paste attacks. Second, we use SNAFUE to red team an ImageNet classifier. We reproduce copy/paste attacks from previous works and find hundreds of other easily-describable vulnerabilities, all without a human in the loop. Code is available at https://github.com/thestephencasper/snafue

Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

TL;DR

The paper presents SNAFUE, an automated method to diagnose DNN weaknesses by discovering natural adversarial features through embeddings of synthetic and natural patches. It demonstrates scalable, automated red-teaming on ImageNet copy/paste attacks, reproducing prior results and uncovering hundreds of additional vulnerabilities with interpretable features. The approach combines latent-space adversarial patches, cosine similarity screening, and selective natural patches to reveal human-describable failure modes, offering a practical tool for scalable AI oversight. The work highlights both the potential and limitations of automated interpretability, and provides a path toward safer deployment and further NLP extensions.

Abstract

This paper considers the problem of helping humans exercise scalable oversight over deep neural networks (DNNs). Adversarial examples can be useful by helping to reveal weaknesses in DNNs, but they can be difficult to interpret or draw actionable conclusions from. Some previous works have proposed using human-interpretable adversarial attacks including copy/paste attacks in which one natural image pasted into another causes an unexpected misclassification. We build on these with two contributions. First, we introduce Search for Natural Adversarial Features Using Embeddings (SNAFUE) which offers a fully automated method for finding copy/paste attacks. Second, we use SNAFUE to red team an ImageNet classifier. We reproduce copy/paste attacks from previous works and find hundreds of other easily-describable vulnerabilities, all without a human in the loop. Code is available at https://github.com/thestephencasper/snafue
Paper Structure (11 sections, 2 equations, 9 figures)

This paper contains 11 sections, 2 equations, 9 figures.

Figures (9)

  • Figure 1: SNAFUE, our automated method for finding targeted copy/paste attacks. This example illustrates an experiment which found that cats can make photocopiers misclassified as printers. (a) First, we create feature level adversarial patches as in casper2022robust by perturbing the latent activations of a generator. (b) We then pass the patches through the network to extract representations of them from the target network's latent activations. Finally, we select the natural patches whose latents are the most similar to the adversarial ones.
  • Figure 2: Examples of targeted natural adversarial patches for image classification identified using SNAFUE. They reveal consistent, easily-describable failure modes that can be used to interpret the network (e.g. "envelopes plus cats are misclassified by the network as cartons"). Each row contains 10 patches labeled with the attack source and target. When a patch is inserted into any source class image, it tends to cause misclassification as the target class. See Figure \ref{['fig:breadth']} for quantitative evaluation and Figure \ref{['fig:nat_examples2']} in the Appendix for additional examples.
  • Figure 3: Our automated replications of all 9 prior examples of ImageNet copy/paste attacks of which we are aware from carter2019activationhernandez2022natural and casper2022robust. Each set of images is labeled source class$\to$target class. Each row of 10 patches is labeled with their mean success rate.
  • Figure 4: (Top) Examples of copy/paste attacks between similar source/target classes. Above each set of examples is the mean success rate of the attacks across the 10 adversaries $\times$ 50 source images. (Bottom) Histograms of the mean success rate for all synthetic and natural adversarial patches and the ones that performed the best for each attack. Labels for the adversarial features (e.g. "white fur") are human-produced.
  • Figure 5: (Top) Examples from our most successful copy/paste attack using a clothing source and a traffic target. The mean success rate of the attacks across 10 adversaries $\times$ 50 source images is shown above each example. (Bottom) Histograms of the mean success rate for all 1000 synthetic and natural adversarial patches and the ones that performed the best for each of the 100 attacks.
  • ...and 4 more figures