Table of Contents
Fetching ...

Seeing with Humans: Gaze-Assisted Neural Image Captioning

Yusuke Sugano, Andreas Bulling

TL;DR

This work investigates integrating human gaze into holistic image understanding, specifically image captioning. It analyzes how gaze relates to object and scene recognition and introduces a split-attention LSTM that uses a fixation-aware mechanism to allocate attention to both fixated and non-fixated regions. Empirical results on COCO/SALICON show consistent captioning gains and improved word discovery for small, semantically important objects, underscoring gaze as a complementary signal to machine attention. The study highlights potential for gaze-guided captioning in cluttered images and suggests future work with real, egocentric gaze data.

Abstract

Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captioning by studying the interplay between human gaze and the attention mechanism of deep neural networks. Using a public large-scale gaze dataset, we first assess the relationship between state-of-the-art object and scene recognition models, bottom-up visual saliency, and human gaze. We then propose a novel split attention model for image captioning. Our model integrates human gaze information into an attention-based long short-term memory architecture, and allows the algorithm to allocate attention selectively to both fixated and non-fixated image regions. Through evaluation on the COCO/SALICON datasets we show that our method improves image captioning performance and that gaze can complement machine attention for semantic scene understanding tasks.

Seeing with Humans: Gaze-Assisted Neural Image Captioning

TL;DR

This work investigates integrating human gaze into holistic image understanding, specifically image captioning. It analyzes how gaze relates to object and scene recognition and introduces a split-attention LSTM that uses a fixation-aware mechanism to allocate attention to both fixated and non-fixated regions. Empirical results on COCO/SALICON show consistent captioning gains and improved word discovery for small, semantically important objects, underscoring gaze as a complementary signal to machine attention. The study highlights potential for gaze-guided captioning in cluttered images and suggests future work with real, egocentric gaze data.

Abstract

Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captioning by studying the interplay between human gaze and the attention mechanism of deep neural networks. Using a public large-scale gaze dataset, we first assess the relationship between state-of-the-art object and scene recognition models, bottom-up visual saliency, and human gaze. We then propose a novel split attention model for image captioning. Our model integrates human gaze information into an attention-based long short-term memory architecture, and allows the algorithm to allocate attention selectively to both fixated and non-fixated image regions. Through evaluation on the COCO/SALICON datasets we show that our method improves image captioning performance and that gaze can complement machine attention for semantic scene understanding tasks.

Paper Structure

This paper contains 13 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Our method takes gaze-annotated images as input, and uses both human gaze and bottom-up visual features for attention-based captioning.
  • Figure 2: Top-$k$ accuracy for object and scene recognition. The horizontal axis indicates the ratio of the visible area given by thresholding the fixation, saliency and center maps. $k$ is set to the number of labels associated with each image.
  • Figure 3: Comparison of feature importance maps. Mean maps over all corresponding labels are overlaid onto the image with a color coding from blue (lowest importance) to red (highest importance). For better visual comparison, all maps were histogram equalized.
  • Figure 4: Pipeline of the gaze-assisted image captioning. The attention function takes both image and gaze features as input, and the context vector weighted with the attention is given to the LSTM cell for word-by-word captioning.
  • Figure 5: Sample images, the machine attention map at each step as well as the corresponding output words for the baseline and gaze-assisted models. The first example illustrates the case where the proposed model finds small but important objects ( kite) in the scene. It also helps to suppress the repetition of object description in cluttered scenes ( laptop in the second example). The proposed split attention model can also describe objects without strong fixation, such as snowboard in the third example. See the supplementary material for more examples.
  • ...and 1 more figures