Table of Contents
Fetching ...

Biologically inspired protection of deep networks from adversarial attacks

Aran Nayebi, Surya Ganguli

TL;DR

This work addresses the vulnerability of deep networks to imperceptible adversarial perturbations by adopting a biologically inspired approach: training networks to operate in a highly nonlinear, saturated regime without exposing them to adversarial examples. A saturating penalty, applied with an annealing schedule, drives activations into saturation, yielding improved robustness against gradient-based and iterative attacks while maintaining standard accuracy. The authors provide mechanistic insights via information geometry, showing flat input–output mappings and strongly separated class clusters, and identify high weight kurtosis—reminiscent of brain statistics—as a linear contributor to robustness. Together, these results offer a biologically plausible and theoretically grounded route to intrinsic adversarial robustness, with implications for both AI and neuroscience.

Abstract

Inspired by biophysical principles underlying nonlinear dendritic computation in neural circuits, we develop a scheme to train deep neural networks to make them robust to adversarial attacks. Our scheme generates highly nonlinear, saturated neural networks that achieve state of the art performance on gradient based adversarial examples on MNIST, despite never being exposed to adversarially chosen examples during training. Moreover, these networks exhibit unprecedented robustness to targeted, iterative schemes for generating adversarial examples, including second-order methods. We further identify principles governing how these networks achieve their robustness, drawing on methods from information geometry. We find these networks progressively create highly flat and compressed internal representations that are sensitive to very few input dimensions, while still solving the task. Moreover, they employ highly kurtotic weight distributions, also found in the brain, and we demonstrate how such kurtosis can protect even linear classifiers from adversarial attack.

Biologically inspired protection of deep networks from adversarial attacks

TL;DR

This work addresses the vulnerability of deep networks to imperceptible adversarial perturbations by adopting a biologically inspired approach: training networks to operate in a highly nonlinear, saturated regime without exposing them to adversarial examples. A saturating penalty, applied with an annealing schedule, drives activations into saturation, yielding improved robustness against gradient-based and iterative attacks while maintaining standard accuracy. The authors provide mechanistic insights via information geometry, showing flat input–output mappings and strongly separated class clusters, and identify high weight kurtosis—reminiscent of brain statistics—as a linear contributor to robustness. Together, these results offer a biologically plausible and theoretically grounded route to intrinsic adversarial robustness, with implications for both AI and neuroscience.

Abstract

Inspired by biophysical principles underlying nonlinear dendritic computation in neural circuits, we develop a scheme to train deep neural networks to make them robust to adversarial attacks. Our scheme generates highly nonlinear, saturated neural networks that achieve state of the art performance on gradient based adversarial examples on MNIST, despite never being exposed to adversarially chosen examples during training. Moreover, these networks exhibit unprecedented robustness to targeted, iterative schemes for generating adversarial examples, including second-order methods. We further identify principles governing how these networks achieve their robustness, drawing on methods from information geometry. We find these networks progressively create highly flat and compressed internal representations that are sensitive to very few input dimensions, while still solving the task. Moreover, they employ highly kurtotic weight distributions, also found in the brain, and we demonstrate how such kurtosis can protect even linear classifiers from adversarial attack.

Paper Structure

This paper contains 9 sections, 18 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Comparing the vanilla and saturating network's weights for the Sigmoid MLP (A) and ReLU MLP (B) in each layer. The excess kurtosis ($\gamma_2$) of each weight distribution is given as well.
  • Figure 2: Comparing the vanilla (first row) and saturating network's activities, $\mathbf{h}^{l}$, for the Sigmoid (second row) and ReLU MLP (third row) in each layer $l$.
  • Figure 3: Relational dissimilarity matrix for the Sigmoid MLP. For the corresponding figure for the ReLU MLP, refer to Figure \ref{['relu-rdm']} in the Supplementary Material (SM).
  • Figure 4: Comparison between the vanilla (top 3 rows) and saturating Sigmoid MLP's (bottom 3 rows) length element, softmax output probabilities, and Jacobian singular values as a function of the interpolation parameter between source and target. The columns denote a particular source and target class pair: $(S: 1, T: 7), (S: 3, T: 7), (S: 6, T: 9), (S: 0, T: 6)$. A similar trend occurs for the ReLU MLP (data not shown).
  • Figure 5: Comparison between the vanilla (A) and saturating Sigmoid MLP's (B) adversarial images after 1000 iterations of L-BFGS. The red 'X' denotes that an adversarial image that causes misclassification was not found within 1000 iterations of L-BFGS. The source image was for the least confident image for that source class. A similar trend occurs for the ReLU MLP (data not shown).
  • ...and 3 more figures