Table of Contents
Fetching ...

Adversarial Examples that Fool both Computer Vision and Time-Limited Humans

Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian Goodfellow, Jascha Sohl-Dickstein

TL;DR

The paper investigates whether adversarial examples that fool CNNs also bias time-limited human perception. It combines black-box transfer attacks with a retina-inspired preprocessing layer and a psychophysics setup to test humans under brief viewing conditions. Results show that perturbations transferring across CNN ensembles bias human judgments and increase error rates, revealing a shared illusion between artificial and biological vision. These findings have implications for ML security and neuroscience, pointing to future work that leverages brain-like processing to improve robustness and to understand human perception under adversarial conditions.

Abstract

Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we address this question by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by matching the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers.

Adversarial Examples that Fool both Computer Vision and Time-Limited Humans

TL;DR

The paper investigates whether adversarial examples that fool CNNs also bias time-limited human perception. It combines black-box transfer attacks with a retina-inspired preprocessing layer and a psychophysics setup to test humans under brief viewing conditions. Results show that perturbations transferring across CNN ensembles bias human judgments and increase error rates, revealing a shared illusion between artificial and biological vision. These findings have implications for ML security and neuroscience, pointing to future work that leverages brain-like processing to improve robustness and to understand human perception under adversarial conditions.

Abstract

Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we address this question by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by matching the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers.

Paper Structure

This paper contains 33 sections, 7 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Adversarial examples optimized on more models / viewpoints sometimes appear more meaningful to humans. This observation is a clue that machine-to-human transfer may be possible. (a) A canonical example of an adversarial image reproduced from goodfellow2014explaining. This adversarial attack has moderate but limited ability to fool the model after geometric transformations or to fool models other than the model used to generate the image. (b) An adversarial attack causing a cat image to be labeled as a computer while being robust to geometric transformations, adopted from athalye2017blog. Unlike the attack in a, the image contains features that seem semantically computer-like to humans. (c) An adversarial patch that causes images to be labeled as a toaster, optimized to cause misclassification from multiple viewpoints, reproduced from brown2017adversarial. Similar to b, the patch contains features that appear toaster-like to a human.
  • Figure 2: Experiment setup and task. (a) examples images from the conditions (image, adv, and flip). Top: adv targeting broccoli class. bottom: adv targeting cat class. See definition of conditions at Section \ref{['sec: experiment conditions']}. (b) example images from the false experiment condition. (c) Experiment setup and recording apparatus. (d) Task structure and timings. The subject is asked to repeatedly identify which of two classes (e.g. dog vs. cat) a briefly presented image belongs to. The image is either adversarial, or belongs to one of several control conditions. See Section \ref{['sec: human experiment']} for details.
  • Figure 3: Adversarial images transfer to humans. (a) By adding adversarial perturbations to an image, we are able to bias which of two incorrect choices subjects make. Plot shows probability of choosing the adversarially targeted class when the true image class is not one of the choices that subjects can report (false condition), estimated by averaging the responses of all subjects (two-tailed t-test relative to chance level $0.5$). (b) Adversarial images cause more mistakes than either clean images or images with the adversarial perturbation flipped vertically before being applied. Plot shows probability of choosing the true image class, when this class is one of the choices that subjects can report, averaged across all subjects. Accuracy is significantly less than 1 even for clean images due to the brief image presentation time. (error bars $\pm$ SE; *: $p<0.05$; **: $p<0.01$; ***: $p<0.001$) (c) A spider image that time-limited humans frequently perceived as a snake (top parentheses: number of subjects tested on this image). right: accuracy on this adversarial image when presented briefly compared to when presented for long time (long presentation is based on a post-experiment email survey of 13 participants).
  • Figure 4: Adversarial images effect human response time. (a) Average response time to false images. (b) Average response time for adv, image, and flip conditions (error bars $\pm$ SE; * reflects $p<0.05$; two sample two-tailed t-test). In all three stimulus groups, there was a trend towards slower response times in the adv condition than in either control group. (c) Probability of choosing the adversarially targeted class in the false condition, estimated by averaging the responses of all subjects (two-tailed t-test relative to chance level $0.5$; error bars $\pm$ SE; *: $p<0.05$; **: $p<0.01$; ***: $p<0.001$). The probability of choosing the targeted label is computed by binning trials within percentile reaction time ranges (0-33 percentile, 33-67 percentile, and 67-100 percentile). The bias relative to chance level of 0.5 is significant when people reported their decision quickly (when they may have been more confident), but not significant when they reported their decision more slowly. As discussed in Section \ref{['transfer to humans']}, differing effect directions in (b) and (c) may be explained by adversarial perturbations decreasing decision confidence in the adv condition, and increasing decision confidence in the false condition.
  • Figure 5: Examples of the types of manipulations performed by the adversarial attack. See Figures \ref{['fig advx pets']} through \ref{['fig advx veg']} for additional examples of adversarial images. Also see Figure \ref{['fig advx false']} for adversarial examples from the false condition.
  • ...and 5 more figures