Table of Contents
Fetching ...

Benford's law: what does it say on adversarial images?

João G. Zago, Fabio L. Baldissera, Eric A. Antonelo, Rodrigo T. Saad

TL;DR

Benford's Law is leveraged as an input-centered detector for adversarial images by transforming inputs with a gradient-magnitude operator and comparing the resulting first-digit distribution to the Benford reference using the KS statistic, with the Benford distribution defined as $P(d)=log_{10}(1+1/d)$ for $d\in\{1,...,9\}$. The authors show that adversarial perturbations cause systematic deviations from Benford's Law, with deviations growing with attack strength, and demonstrate a practical, low-dimensional feature (FAD) based on KS deviation that can detect adversarial inputs with competitive accuracy to full-image CNN detectors but at far lower cost. The work highlights the potential for online monitoring and pre-attack signaling, and points to future extensions to additional attack types and more refined KS-based detection schemes. Overall, the study provides a fast, transformation-based, Benford-deviation signal that can complement existing defenses in adversarial image detection.

Abstract

Convolutional neural networks (CNNs) are fragile to small perturbations in the input images. These networks are thus prone to malicious attacks that perturb the inputs to force a misclassification. Such slightly manipulated images aimed at deceiving the classifier are known as adversarial images. In this work, we investigate statistical differences between natural images and adversarial ones. More precisely, we show that employing a proper image transformation and for a class of adversarial attacks, the distribution of the leading digit of the pixels in adversarial images deviates from Benford's law. The stronger the attack, the more distant the resulting distribution is from Benford's law. Our analysis provides a detailed investigation of this new approach that can serve as a basis for alternative adversarial example detection methods that do not need to modify the original CNN classifier neither work on the raw high-dimensional pixels as features to defend against attacks.

Benford's law: what does it say on adversarial images?

TL;DR

Benford's Law is leveraged as an input-centered detector for adversarial images by transforming inputs with a gradient-magnitude operator and comparing the resulting first-digit distribution to the Benford reference using the KS statistic, with the Benford distribution defined as for . The authors show that adversarial perturbations cause systematic deviations from Benford's Law, with deviations growing with attack strength, and demonstrate a practical, low-dimensional feature (FAD) based on KS deviation that can detect adversarial inputs with competitive accuracy to full-image CNN detectors but at far lower cost. The work highlights the potential for online monitoring and pre-attack signaling, and points to future extensions to additional attack types and more refined KS-based detection schemes. Overall, the study provides a fast, transformation-based, Benford-deviation signal that can complement existing defenses in adversarial image detection.

Abstract

Convolutional neural networks (CNNs) are fragile to small perturbations in the input images. These networks are thus prone to malicious attacks that perturb the inputs to force a misclassification. Such slightly manipulated images aimed at deceiving the classifier are known as adversarial images. In this work, we investigate statistical differences between natural images and adversarial ones. More precisely, we show that employing a proper image transformation and for a class of adversarial attacks, the distribution of the leading digit of the pixels in adversarial images deviates from Benford's law. The stronger the attack, the more distant the resulting distribution is from Benford's law. Our analysis provides a detailed investigation of this new approach that can serve as a basis for alternative adversarial example detection methods that do not need to modify the original CNN classifier neither work on the raw high-dimensional pixels as features to defend against attacks.

Paper Structure

This paper contains 19 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of the proposed approach, composed of three steps: a) transformation of the image $x$, represented by the mapping $T(.)$; b) a statistical analysis of $T(x)$, denoted by $\hat{P}$ and c) a comparison of $\hat{P}$ with a distribution of reference $P$.
  • Figure 2: First digit distribution as in Benford's Law
  • Figure 3: Separation using the KS statistic for adversarial and clean examples. Left: each dot represents one image, attacked (in black) or unattacked (in blue). The maximum separation is achieved by the red horizontal line. Right: separation percentage of the points from the left plot for different horizontal lines that split those points linearly. The maximum is attained by the horizontal red line.
  • Figure 4: Separation between adversarial (black dots) and clean images (blue dots) increases with magnitude of attack's perturbation $\epsilon$. (a), (b), and (c) present the KL divergence between the FDD of the transformed images (samples from Imagenet dataset) and the FDL by Benford's Law. The adversarial examples are generated by the $||.||_\infty$-norm FGSM attack with $\epsilon$ equal to $0.1$, $0.2$ and $0.5$, respectively. (d) shows the mean and standard deviation for the KL-divergence obtained while varying $\epsilon$ for three different datasets.
  • Figure 5: Output of the KS test (our proposal) as the $2-norm$ PGD attack is formed for eleven images from the Imagenet dataset. Each trajectory represents one image that is being attacked, with the final adversarial image represented by a red dot in the end of the trajectory. The points (images) above the dashed horizontal red line can be flagged as under attack by our method.