Table of Contents
Fetching ...

Grounding of Textual Phrases in Images by Reconstruction

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele

TL;DR

This work tackles grounding free-form textual phrases in images by learning to localize phrases through reconstructing them from attended image regions. The proposed GroundeR framework performs grounding via soft attention over region proposals and a reconstruction path that generates the input phrase from the attended features, enabling unsupervised, semi-supervised, and fully supervised training. Empirical results on Flickr 30k Entities and ReferItGame show that GroundeR matches or surpasses state-of-the-art under all supervision levels, with semi-supervised learning particularly benefiting from combining reconstruction with limited labeled data. The approach demonstrates strong practical potential for scalable phrase grounding and motivates future exploration of relational reasoning and joint phrase modeling.

Abstract

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly. During training our approach encodes the phrase using a recurrent network language model and then learns to attend to the relevant image region in order to reconstruct the input phrase. At test time, the correct attention, i.e., the grounding, is evaluated. If grounding supervision is available it can be directly applied via a loss over the attention mechanism. We demonstrate the effectiveness of our approach on the Flickr 30k Entities and ReferItGame datasets with different levels of supervision, ranging from no supervision over partial supervision to full supervision. Our supervised variant improves by a large margin over the state-of-the-art on both datasets.

Grounding of Textual Phrases in Images by Reconstruction

TL;DR

This work tackles grounding free-form textual phrases in images by learning to localize phrases through reconstructing them from attended image regions. The proposed GroundeR framework performs grounding via soft attention over region proposals and a reconstruction path that generates the input phrase from the attended features, enabling unsupervised, semi-supervised, and fully supervised training. Empirical results on Flickr 30k Entities and ReferItGame show that GroundeR matches or surpasses state-of-the-art under all supervision levels, with semi-supervised learning particularly benefiting from combining reconstruction with limited labeled data. The approach demonstrates strong practical potential for scalable phrase grounding and motivates future exploration of relational reasoning and joint phrase modeling.

Abstract

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly. During training our approach encodes the phrase using a recurrent network language model and then learns to attend to the relevant image region in order to reconstruct the input phrase. At test time, the correct attention, i.e., the grounding, is evaluated. If grounding supervision is available it can be directly applied via a loss over the attention mechanism. We demonstrate the effectiveness of our approach on the Flickr 30k Entities and ReferItGame datasets with different levels of supervision, ranging from no supervision over partial supervision to full supervision. Our supervised variant improves by a large margin over the state-of-the-art on both datasets.

Paper Structure

This paper contains 13 sections, 11 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (a) Without bounding box annotations at training time our approach GroundeR can ground free-form natural language phrases in images. (b) During training our latent attention approach reconstructs phrases by learning to attend to the correct box. (c) At test time, the attention model infers the grounding for each phrase. For semi-supervised and fully supervised variants see Fig. \ref{['fig:net']}.
  • Figure 2: Our model learns grounding of textual phrases in images with (a) no, (b) little (c) or full supervision of localization, through a grounding part and a reconstruction part. During training, the model distributes its attention to a single or several boxes, and learns to reconstruct the input phrase based on the boxes it attends to. At test time, only the grounding part is used.
  • Figure 3: Qualitative results on the test set of Flickr 30k Entities. Top : GroundeR (VGG-DET) unsupervised, bottom: GroundeR (VGG-DET) supervised.
  • Figure 4: Qualitative results on the test set of ReferItGame: GroundeR (VGG+SPAT) supervised. Green: ground-truth box, red: predicted box.