Table of Contents
Fetching ...

On Inherent Adversarial Robustness of Active Vision Systems

Amitangshu Mukherjee, Timur Ibrayev, Kaushik Roy

TL;DR

Deep neural networks remain vulnerable to adversarial perturbations, unlike human vision which benefits from saccades and foveation. The authors propose active vision mechanisms, GFNet and FALcon, that process images through glimpses at downsampled resolutions and from multiple fixation points, and demonstrate their inherent robustness under a black-box threat model. Across ImageNet evaluations and multiple iterative attacks, GFNet and FALcon achieve $2$-$3$x higher accuracy under attack compared to a passive baseline, aided by interpretable visualizations such as Initial Fixation Point Maps (IFPM) and occlusion analyses. This work suggests that incorporating active, biologically inspired processing can enhance robustness and informs future defense strategies and bio-inspired robustness research.

Abstract

Current Deep Neural Networks are vulnerable to adversarial examples, which alter their predictions by adding carefully crafted noise. Since human eyes are robust to such inputs, it is possible that the vulnerability stems from the standard way of processing inputs in one shot by processing every pixel with the same importance. In contrast, neuroscience suggests that the human vision system can differentiate salient features by (1) switching between multiple fixation points (saccades) and (2) processing the surrounding with a non-uniform external resolution (foveation). In this work, we advocate that the integration of such active vision mechanisms into current deep learning systems can offer robustness benefits. Specifically, we empirically demonstrate the inherent robustness of two active vision methods - GFNet and FALcon - under a black box threat model. By learning and inferencing based on downsampled glimpses obtained from multiple distinct fixation points within an input, we show that these active methods achieve (2-3) times greater robustness compared to a standard passive convolutional network under state-of-the-art adversarial attacks. More importantly, we provide illustrative and interpretable visualization analysis that demonstrates how performing inference from distinct fixation points makes active vision methods less vulnerable to malicious inputs.

On Inherent Adversarial Robustness of Active Vision Systems

TL;DR

Deep neural networks remain vulnerable to adversarial perturbations, unlike human vision which benefits from saccades and foveation. The authors propose active vision mechanisms, GFNet and FALcon, that process images through glimpses at downsampled resolutions and from multiple fixation points, and demonstrate their inherent robustness under a black-box threat model. Across ImageNet evaluations and multiple iterative attacks, GFNet and FALcon achieve -x higher accuracy under attack compared to a passive baseline, aided by interpretable visualizations such as Initial Fixation Point Maps (IFPM) and occlusion analyses. This work suggests that incorporating active, biologically inspired processing can enhance robustness and informs future defense strategies and bio-inspired robustness research.

Abstract

Current Deep Neural Networks are vulnerable to adversarial examples, which alter their predictions by adding carefully crafted noise. Since human eyes are robust to such inputs, it is possible that the vulnerability stems from the standard way of processing inputs in one shot by processing every pixel with the same importance. In contrast, neuroscience suggests that the human vision system can differentiate salient features by (1) switching between multiple fixation points (saccades) and (2) processing the surrounding with a non-uniform external resolution (foveation). In this work, we advocate that the integration of such active vision mechanisms into current deep learning systems can offer robustness benefits. Specifically, we empirically demonstrate the inherent robustness of two active vision methods - GFNet and FALcon - under a black box threat model. By learning and inferencing based on downsampled glimpses obtained from multiple distinct fixation points within an input, we show that these active methods achieve (2-3) times greater robustness compared to a standard passive convolutional network under state-of-the-art adversarial attacks. More importantly, we provide illustrative and interpretable visualization analysis that demonstrates how performing inference from distinct fixation points makes active vision methods less vulnerable to malicious inputs.
Paper Structure (15 sections, 2 equations, 6 figures, 3 tables)

This paper contains 15 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Figure illustrates two methods of processing an adversarial input -- a passive method and an active method. The highlighted yellow box serves as a visual illustration of an adversarial sticker. Left column: One of the probable causes of passive methods being susceptible to adversarial inputs is the uniform processing : that means processing each pixel with same importance. Middle column: In contrast, active vision methods (A) GFNet and (B) FALcon learn the salient features of an object by observing it from multiple distinct fixation points via sequence of glimpses as indicated by the blue boxes. Right column: As a result, during inference, this leads to distinct predictions for the same image not all of which are affected by the adversarial noise.
  • Figure 2: (Learning & Inference) The figure provides an overview of GFNet's operation. It begins by downsampling the input image to a lower resolution for rapid prediction $(p_{1})$, termed $f_{G}$ (Glance) at $t=1$. If the network lacks confidence $(p_{1} < \eta_{1})$, it enters subsequent $f_{L}$ (Focus) steps until certainty is attained or till $(t=4)$. Each focus step analyzes a patch $(H' \times W')$ cropped from the original input $(H \times W)$ centered around $(c_{t})$ illustrated by colored dots. These co-ordinates are determined by a patch proposal network $\pi$. The process is depicted for a sequence length of $4$.
  • Figure 3: The figure provides a high-level overview of FALcon. During Learning, a Localizer network $(f_{L})$ is trained to predict five distinct actions (four for expansion and one for switching), enabling it to learn the importance of each fixation point illustrated by colored dots. Learning occurs in a downsampled resolution of $(H' \times W')$. During Inference, $f_{L}$ starts from each pre-defined multiple fixation point (20 red dots). If salient object features are present, $f_{L}$ performs the learned expansions to capture the object (4 colored dashed boxes, colored dots). The most confident final foveated glimpse (red solid box) is cropped $(H' \times W')$ and presented to the classifier $(f_{C})$.
  • Figure 4: Figure illustrates Initial Fixation Point Maps (IFPM) to show the efficacy of performing inference from multiple fixation points. An IFPM is a visual representation that depicts the spatial locations of the initial starting positions of FALcon. (b) illustrates all initial fixations points via gridding for both clean and adversarial inputs. (c & d) show the potentialandevaluated initial fixation points for a clean sample. Similarly, (e & f) show the same for an adversarial sample. An evaluatedIFPM can consist of both correct and incorrect points as denoted by 2f. Adversarial noise spreads non-uniformly across an image and affects different initial points differently. This is indicated by the reduced number of potential(c to e) and correct points (d to f) from a clean to an adversarial sample. Still, the presence of a positive number of correct points(f) underscores the inherent robustness of an active method.
  • Figure 5: Precision of predictions
  • ...and 1 more figures