Table of Contents
Fetching ...

Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises

Minkyu Choi, Yizhen Zhang, Kuan Han, Xiaokai Wang, Zhongming Liu

TL;DR

A dual-stream vision model inspired by the human brain is designed that can attend and gaze in ways similar to humans without being explicitly trained to mimic human attention and that the model can enhance robustness against adversarial attacks due to its retinal sampling and recurrent processing.

Abstract

Humans actively observe the visual surroundings by focusing on salient objects and ignoring trivial details. However, computer vision models based on convolutional neural networks (CNN) often analyze visual input all at once through a single feed-forward pass. In this study, we designed a dual-stream vision model inspired by the human brain. This model features retina-like input layers and includes two streams: one determining the next point of focus (the fixation), while the other interprets the visuals surrounding the fixation. Trained on image recognition, this model examines an image through a sequence of fixations, each time focusing on different parts, thereby progressively building a representation of the image. We evaluated this model against various benchmarks in terms of object recognition, gaze behavior and adversarial robustness. Our findings suggest that the model can attend and gaze in ways similar to humans without being explicitly trained to mimic human attention, and that the model can enhance robustness against adversarial attacks due to its retinal sampling and recurrent processing. In particular, the model can correct its perceptual errors by taking more glances, setting itself apart from all feed-forward-only models. In conclusion, the interactions of retinal sampling, eye movement, and recurrent dynamics are important to human-like visual exploration and inference.

Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises

TL;DR

A dual-stream vision model inspired by the human brain is designed that can attend and gaze in ways similar to humans without being explicitly trained to mimic human attention and that the model can enhance robustness against adversarial attacks due to its retinal sampling and recurrent processing.

Abstract

Humans actively observe the visual surroundings by focusing on salient objects and ignoring trivial details. However, computer vision models based on convolutional neural networks (CNN) often analyze visual input all at once through a single feed-forward pass. In this study, we designed a dual-stream vision model inspired by the human brain. This model features retina-like input layers and includes two streams: one determining the next point of focus (the fixation), while the other interprets the visuals surrounding the fixation. Trained on image recognition, this model examines an image through a sequence of fixations, each time focusing on different parts, thereby progressively building a representation of the image. We evaluated this model against various benchmarks in terms of object recognition, gaze behavior and adversarial robustness. Our findings suggest that the model can attend and gaze in ways similar to humans without being explicitly trained to mimic human attention, and that the model can enhance robustness against adversarial attacks due to its retinal sampling and recurrent processing. In particular, the model can correct its perceptual errors by taking more glances, setting itself apart from all feed-forward-only models. In conclusion, the interactions of retinal sampling, eye movement, and recurrent dynamics are important to human-like visual exploration and inference.
Paper Structure (21 sections, 6 equations, 8 figures, 4 tables)

This paper contains 21 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Recurrent models for eye movement and visual recognition. (a) The general two-stream model architecture: The model includes two streams - the dorsal (left) and ventral (right) streams. The dorsal stream has a broad field of view and generates a fixation point ($l_{t}$) at a step (or glance). The ventral stream takes selective samples around the fixation, extracts a representation, and accumulates the representations across multiple glances for object recognition ($p_t$). Using this general scheme, we design and test three different implementations of the ventral stream, namely Crop-S, Crop-D, and Retina, illustrated in (b) through (d). (b) Crop-S crops a small image region (shown as the red box) around the fixation. (c) Crop-D crops two regions (shown as the red and blue boxes) around the fixation and samples them with different resolutions such that the same number of samples is extracted from either region. (d) Retina applies retinal transformation and extracts non-uniform samples with respect to the fixation. Those models use the same architecture for the dorsal stream, for which the weights are learned separately alongside the different ventral stream models.
  • Figure 2: Examples of the retinal transformation. Left: Original image, Right: Retinal sampling grid and resulting retinal images.
  • Figure 3: Details of the Attention module in the dorsal stream. Attention produces 2D saliency map and fixation point at time step $t$ ($l_{t}$). Intermediate representations from each module are shown on the right.
  • Figure 4: Saliency maps from top to bottom: humans, Retina, Crop-D, Crop-S, S3TA, and FF-CNN. Regions of higher saliency are highlighted in red, while areas of lower saliency are depicted in blue.
  • Figure 5: (a) Learned fixations from the first four steps. From the first fixation to the last fixations are marked as Red-Blue-Green-Black. First column presents exemplar attention shifts. (b) Visualization of the retinal transformation as the object is shifted from the fovea to the periphery. (c) Examples of the retinal transformation as the hyperparameter $b$ is changing from $8$ to $16$.
  • ...and 3 more figures