Table of Contents
Fetching ...

Recurrent Models of Visual Attention

Volodymyr Mnih, Nicolas Heess, Alex Graves, Koray Kavukcuoglu

TL;DR

RAM introduces a recurrent, attention-based model that processes only selected image regions with a retina-like glimpse, reducing computation while maintaining accuracy on cluttered tasks. It treats visual processing as a POMDP and trains via policy gradient, optionally augmented with supervised signals, to learn where to look and what action to take. Empirical results show RAM outperforming comparable baselines on cluttered recognition and successfully learned control in a dynamic environment, highlighting the practical value of task-driven visual attention. The approach offers scalable, flexible perception modules for static and dynamic settings and suggests extensions like stopping decisions and multi-scale sensing.

Abstract

Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.

Recurrent Models of Visual Attention

TL;DR

RAM introduces a recurrent, attention-based model that processes only selected image regions with a retina-like glimpse, reducing computation while maintaining accuracy on cluttered tasks. It treats visual processing as a POMDP and trains via policy gradient, optionally augmented with supervised signals, to learn where to look and what action to take. Empirical results show RAM outperforming comparable baselines on cluttered recognition and successfully learned control in a dynamic environment, highlighting the practical value of task-driven visual attention. The approach offers scalable, flexible perception modules for static and dynamic settings and suggests extensions like stopping decisions and multi-scale sensing.

Abstract

Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.

Paper Structure

This paper contains 9 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: A) Glimpse Sensor: Given the coordinates of the glimpse and an input image, the sensor extracts a retina-like representation $\rho(x_t, l_{t-1})$ centered at $l_{t-1}$ that contains multiple resolution patches. B) Glimpse Network: Given the location $(l_{t-1})$ and input image $(x_t)$, uses the glimpse sensor to extract retina representation $\rho(x_t,l_{t-1})$. The retina representation and glimpse location is then mapped into a hidden space using independent linear layers parameterized by $\theta_g^0$ and $\theta_g^1$ respectively using rectified units followed by another linear layer $\theta_g^2$ to combine the information from both components. The glimpse network $f_g(.;\{\theta_g^0,\theta_g^1,\theta_g^2\})$ defines a trainable bandwidth limited sensor for the attention network producing the glimpse representation $g_t$. C) Model Architecture: Overall, the model is an RNN. The core network of the model $f_h(.;\theta_h)$ takes the glimpse representation $g_t$ as input and combining with the internal representation at previous time step $h_{t-1}$, produces the new internal state of the model $h_t$. The location network $f_l(.;\theta_l)$ and the action network $f_a(.;\theta_a)$ use the internal state $h_t$ of the model to produce the next location to attend to $l_{t}$ and the action/classification $a_t$ respectively. This basic RNN iteration is repeated for a variable number of steps.
  • Figure 2: Examples of test cases for the Translated and Cluttered Translated MNIST tasks.
  • Figure 3: Examples of the learned policy on $60\times60$ cluttered-translated MNIST task. Column 1: The input image with glimpse path overlaid in green. Columns 2-7: The six glimpses the network chooses. The center of each image shows the full resolution glimpse, the outer low resolution areas are obtained by upscaling the low resolution glimpses back to full image size. The glimpse paths clearly show that the learned policy avoids computation in empty or noisy parts of the input space and directly explores the area around the object of interest.
  • Figure 4: Examples of the learned policy on $60\times60$ cluttered-translated MNIST task. Column 1: The input image from MNIST test set with glimpse path overlaid in green (correctly classified) or red (false classified). Columns 2-7: The six glimpses the network chooses. The center of each image shows the full resolution glimpse, the outer low resolution areas are obtained by upscaling the low resolution glimpses back to full image size. The glimpse paths clearly show that the learned policy avoids computation in empty or noisy parts of the input space and directly explores the area around the object of interest.
  • Figure 5: Examples of the learned policy on $60\times60$ cluttered-translated MNIST task. Column 1: The input image from MNIST test set with glimpse path overlaid in green (correctly classified) or red (false classified). Columns 2-7: The six glimpses the network chooses. The center of each image shows the full resolution glimpse, the outer low resolution areas are obtained by upscaling the low resolution glimpses back to full image size. The glimpse paths clearly show that the learned policy avoids computation in empty or noisy parts of the input space and directly explores the area around the object of interest.
  • ...and 1 more figures