Table of Contents
Fetching ...

DeepGaze II: Reading fixations from deep features trained on object recognition

Matthias Kümmerer, Thomas S. A. Wallis, Matthias Bethge

TL;DR

The paper addresses predicting human fixations in free-viewing images and introduces DeepGaze II, which leverages fixed VGG-19 features as a general representation with a compact readout network to produce a saliency density p(x,y|I) within a probabilistic, log-likelihood framework. It trains via SALICON pretraining followed by image-wise cross-validated fine-tuning on MIT1003, and evaluates on MIT300 using an ensemble of ten models, achieving 87% of the explainable information gain and top MIT benchmark AUC metrics. The key contributions are demonstrating strong transfer learning from object-recognition features to saliency, showing that not retraining the feature extractor can yield robust performance, and providing both qualitative and quantitative analyses of where the model succeeds and where it falters. The work highlights the practical impact of using deep features for related visual tasks and offers a public web service for generating predictions.

Abstract

Here we present DeepGaze II, a model that predicts where people look in images. The model uses the features from the VGG-19 deep neural network trained to identify objects in images. Contrary to other saliency models that use deep features, here we use the VGG features for saliency prediction with no additional fine-tuning (rather, a few readout layers are trained on top of the VGG features to predict saliency). The model is therefore a strong test of transfer learning. After conservative cross-validation, DeepGaze II explains about 87% of the explainable information gain in the patterns of fixations and achieves top performance in area under the curve metrics on the MIT300 hold-out benchmark. These results corroborate the finding from DeepGaze I (which explained 56% of the explainable information gain), that deep features trained on object recognition provide a versatile feature space for performing related visual tasks. We explore the factors that contribute to this success and present several informative image examples. A web service is available to compute model predictions at http://deepgaze.bethgelab.org.

DeepGaze II: Reading fixations from deep features trained on object recognition

TL;DR

The paper addresses predicting human fixations in free-viewing images and introduces DeepGaze II, which leverages fixed VGG-19 features as a general representation with a compact readout network to produce a saliency density p(x,y|I) within a probabilistic, log-likelihood framework. It trains via SALICON pretraining followed by image-wise cross-validated fine-tuning on MIT1003, and evaluates on MIT300 using an ensemble of ten models, achieving 87% of the explainable information gain and top MIT benchmark AUC metrics. The key contributions are demonstrating strong transfer learning from object-recognition features to saliency, showing that not retraining the feature extractor can yield robust performance, and providing both qualitative and quantitative analyses of where the model succeeds and where it falters. The work highlights the practical impact of using deep features for related visual tasks and offers a public web service for generating predictions.

Abstract

Here we present DeepGaze II, a model that predicts where people look in images. The model uses the features from the VGG-19 deep neural network trained to identify objects in images. Contrary to other saliency models that use deep features, here we use the VGG features for saliency prediction with no additional fine-tuning (rather, a few readout layers are trained on top of the VGG features to predict saliency). The model is therefore a strong test of transfer learning. After conservative cross-validation, DeepGaze II explains about 87% of the explainable information gain in the patterns of fixations and achieves top performance in area under the curve metrics on the MIT300 hold-out benchmark. These results corroborate the finding from DeepGaze I (which explained 56% of the explainable information gain), that deep features trained on object recognition provide a versatile feature space for performing related visual tasks. We explore the factors that contribute to this success and present several informative image examples. A web service is available to compute model predictions at http://deepgaze.bethgelab.org.

Paper Structure

This paper contains 10 sections, 6 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: The architecture of DeepGaze II. The activations of a subset of the VGG feature maps for a given image are passed to a second neural network (the readout network) consisting of four layers of $1 \times 1$ convolutions. The parameters of VGG are held fixed through training (only the readout network learns about saliency prediction). This results in a final saliency map, which is then blurred, combined with a centre bias and converted into a probability distribution by means of a softmax.
  • Figure 2: Training and crossvalidation procedure of the readout network used for DeepGaze II. In the pretraining phase, the model is trained on the 10000 images of the SALICON dataset using the 1003 images from the MIT1003 as a stopping criterion. In the fine-tuning phase, ten models are trained (starting from the pretrained model), each on 90% of the MIT1003 data for training and a unique 10% for stopping (10-fold crossvalidation). In our evaluation (reported below), for each image we use the model predictions from the model that did not use that image in training. The final model evaluation is performed via the MIT benchmark on the held-out MIT300 dataset, based on a mixture of the ten models from the fine-tuning stage.
  • Figure 3: Model performance (information gain explained as a percentage of the gold standard model's information gain relative to the baseline model) for a selection of models from the MIT Benchmark, DeepGaze I and DeepGaze II. The eDN model (state-of-the-art in 2014) explained 34% of the explainable information gain, and DeepGaze I explains 56%. DeepGaze II gains a substantial improvement over DeepGaze I, explaining 87% of the explainable information in the evaluation set.
  • Figure 4: Gold standard information gain against model information gain relative to the baseline model, for AIM, eDN, DeepGaze I and DeepGaze II. Each point is an image in the subset of the MIT1003 dataset used for evaluation. DeepGaze II is highly correlated with the gold standard, and is the only model for which no images show negative information gain (i.e. for which the model's prediction is worse than the pure centre bias).
  • Figure 5: The three images for which DeepGaze II had the highest information gain explained. For each unique image, the leftmost column shows the image itself (top) and the empirical fixations (bottom). The remaining columns show model predictions for the gold standard model, DeepGaze II, DeepGaze I and the eDN model respectively. The top row visualises probability densities, in which contour lines divide the images into four regions, each of which is expected to receive equal numbers of fixations. The bottom row shows fixations sampled from the model (see text for details). Sampled fixations can be compared to the empirical fixations to gain additional insight into model performance.
  • ...and 4 more figures