Table of Contents
Fetching ...

Solving adversarial examples requires solving exponential misalignment

Alessandro Salvatore, Stanislav Fort, Surya Ganguli

TL;DR

The results connect the fields of alignment and adversarial examples, and suggest the curse of high dimensionality of machine PMs is a major impediment to adversarial robustness.

Abstract

Adversarial attacks - input perturbations imperceptible to humans that fool neural networks - remain both a persistent failure mode in machine learning, and a phenomenon with mysterious origins. To shed light, we define and analyze a network's perceptual manifold (PM) for a class concept as the space of all inputs confidently assigned to that class by the network. We find, strikingly, that the dimensionalities of neural network PMs are orders of magnitude higher than those of natural human concepts. Since volume typically grows exponentially with dimension, this suggests exponential misalignment between machines and humans, with exponentially many inputs confidently assigned to concepts by machines but not humans. Furthermore, this provides a natural geometric hypothesis for the origin of adversarial examples: because a network's PM fills such a large region of input space, any input will be very close to any class concept's PM. Our hypothesis thus suggests that adversarial robustness cannot be attained without dimensional alignment of machine and human PMs, and therefore makes strong predictions: both robust accuracy and distance to any PM should be negatively correlated with the PM dimension. We confirmed these predictions across 18 different networks of varying robust accuracy. Crucially, we find even the most robust networks are still exponentially misaligned, and only the few PMs whose dimensionality approaches that of human concepts exhibit alignment to human perception. Our results connect the fields of alignment and adversarial examples, and suggest the curse of high dimensionality of machine PMs is a major impediment to adversarial robustness.

Solving adversarial examples requires solving exponential misalignment

TL;DR

The results connect the fields of alignment and adversarial examples, and suggest the curse of high dimensionality of machine PMs is a major impediment to adversarial robustness.

Abstract

Adversarial attacks - input perturbations imperceptible to humans that fool neural networks - remain both a persistent failure mode in machine learning, and a phenomenon with mysterious origins. To shed light, we define and analyze a network's perceptual manifold (PM) for a class concept as the space of all inputs confidently assigned to that class by the network. We find, strikingly, that the dimensionalities of neural network PMs are orders of magnitude higher than those of natural human concepts. Since volume typically grows exponentially with dimension, this suggests exponential misalignment between machines and humans, with exponentially many inputs confidently assigned to concepts by machines but not humans. Furthermore, this provides a natural geometric hypothesis for the origin of adversarial examples: because a network's PM fills such a large region of input space, any input will be very close to any class concept's PM. Our hypothesis thus suggests that adversarial robustness cannot be attained without dimensional alignment of machine and human PMs, and therefore makes strong predictions: both robust accuracy and distance to any PM should be negatively correlated with the PM dimension. We confirmed these predictions across 18 different networks of varying robust accuracy. Crucially, we find even the most robust networks are still exponentially misaligned, and only the few PMs whose dimensionality approaches that of human concepts exhibit alignment to human perception. Our results connect the fields of alignment and adversarial examples, and suggest the curse of high dimensionality of machine PMs is a major impediment to adversarial robustness.
Paper Structure (43 sections, 14 equations, 33 figures, 1 table, 1 algorithm)

This paper contains 43 sections, 14 equations, 33 figures, 1 table, 1 algorithm.

Figures (33)

  • Figure 1: Visualization of our main argument: We show that a network's perceptual manifold (PM) for any class concept (e.g. cat), defined to be the set of all images confidently perceived by the network as that class, is extremely high dimensional: $3000$ out of a total of $3072$ for CIFAR10 (large red manifold) and $\approx 135,000$ out of $150,528$ for CLIP and ImageNet. In contrast natural images perceived by humans as any class (e.g. dogs, airplanes or cats (blue, green and bright red manifolds)) are only $\approx 20$ dimensional. This indicates that machine and human PMs for any concept are exponentially misaligned: there are exponentially many inputs confidently perceived as any given concept by machines, but not by humans (e.g. the two noise images in the network's cat perceptual manifold). This exponential misalignment also explains the origin of adversarial examples: e.g. because the network's cat PM fills up so much of image space, any other input (e.g. dog or airplane) is extremely close to it.
  • Figure 2: Comparison between the dimensionality (Participation Ratio on the left and Two Nearest Neighbors on the right) of the Perceptual Manifold of a WideResNet-28-10 (clean accuracy of $94.78\%$ and robust accuracy of $0\%$) to that of the natural images for each class. In the left plot, the PR of the natural images is $\approx 10$, which makes it barely visible. The arrows in the right plot indicate that those values are lower bounds. The excessive dimensionality of machine PMs relative to their natural counterparts signals exponential misalignment.
  • Figure 3: Monte Carlo estimate of the squared distance between a random point sampled from unit hypercube and the boundary of a $d$-dimensional ellipsoid. The principal axes are sampled from a uniform distribution $\mathcal{U}[6,30]$, so the volume of the full 3072 dimensional ellipsoid is roughly equal to the estimated volume of the non robust model's PM (when approximating it by an ellipsoid). Error bars cover the $\pm 1\sigma$ interval. Theory is the red dashed curve.
  • Figure 4: Dimensionality comparison (Participation Ratio and Two Nearest Neighbors) of the CLIP Perceptual Manifold versus natural images (LSUN dataset). Arrows indicate that those reported are lower bounds on the actual value, we include a plot of how the predicted PR and 2NN scale with dataset size in \ref{['fig: scaling clip pr']} and \ref{['fig: scaling clip two nn']}. We include a "Gibberish" control prompt: "kjdbfkw hsafj asjf gjkbg"; note that the natural dimensionality of the gibberish class is undefined (N/A) as no natural images correspond to it.
  • Figure 5: Representative samples from CLIP's Perceptual Manifold for valid descriptions (e.g., "a photo of a bedroom") and the control prompt "kjdbfkw hsafj asjf gjkbg". Visually, all samples appear as noise, again indicating exponential misalignment between machine and human perception.
  • ...and 28 more figures