Table of Contents
Fetching ...

Learning to generalize to new compositions in image understanding

Yuval Atzmon, Jonathan Berant, Vahid Kezami, Amir Globerson, Gal Chechik

TL;DR

Problem: image captioning models generalize poorly to novel compositions of known objects and relations. Approach: represent descriptions with subject-relation-object triplets and evaluate using a compositional split; implement a structured prediction model with SSVM on top of a CNN and compare to Show-Attend-and-Tell. Findings: the structured SSVM model substantially outperforms SA&T on unseen compositions (approximately sevenfold accuracy gain) and achieves competitive performance on standard splits; SA&T shows a large generalization gap. Significance: provides a concrete benchmark for compositional generalization in vision-language tasks and argues for models that explicitly capture linguistic-visual structure rather than relying solely on text statistics.

Abstract

Recurrent neural networks have recently been used for learning to describe images using natural language. However, it has been observed that these models generalize poorly to scenes that were not observed during training, possibly depending too strongly on the statistics of the text in the training data. Here we propose to describe images using short structured representations, aiming to capture the crux of a description. These structured representations allow us to tease-out and evaluate separately two types of generalization: standard generalization to new images with similar scenes, and generalization to new combinations of known entities. We compare two learning approaches on the MS-COCO dataset: a state-of-the-art recurrent network based on an LSTM (Show, Attend and Tell), and a simple structured prediction model on top of a deep network. We find that the structured model generalizes to new compositions substantially better than the LSTM, ~7 times the accuracy of predicting structured representations. By providing a concrete method to quantify generalization for unseen combinations, we argue that structured representations and compositional splits are a useful benchmark for image captioning, and advocate compositional models that capture linguistic and visual structure.

Learning to generalize to new compositions in image understanding

TL;DR

Problem: image captioning models generalize poorly to novel compositions of known objects and relations. Approach: represent descriptions with subject-relation-object triplets and evaluate using a compositional split; implement a structured prediction model with SSVM on top of a CNN and compare to Show-Attend-and-Tell. Findings: the structured SSVM model substantially outperforms SA&T on unseen compositions (approximately sevenfold accuracy gain) and achieves competitive performance on standard splits; SA&T shows a large generalization gap. Significance: provides a concrete benchmark for compositional generalization in vision-language tasks and argues for models that explicitly capture linguistic-visual structure rather than relying solely on text statistics.

Abstract

Recurrent neural networks have recently been used for learning to describe images using natural language. However, it has been observed that these models generalize poorly to scenes that were not observed during training, possibly depending too strongly on the statistics of the text in the training data. Here we propose to describe images using short structured representations, aiming to capture the crux of a description. These structured representations allow us to tease-out and evaluate separately two types of generalization: standard generalization to new images with similar scenes, and generalization to new combinations of known entities. We compare two learning approaches on the MS-COCO dataset: a state-of-the-art recurrent network based on an LSTM (Show, Attend and Tell), and a simple structured prediction model on top of a deep network. We find that the structured model generalizes to new compositions substantially better than the LSTM, ~7 times the accuracy of predicting structured representations. By providing a concrete method to quantify generalization for unseen combinations, we argue that structured representations and compositional splits are a useful benchmark for image captioning, and advocate compositional models that capture linguistic and visual structure.

Paper Structure

This paper contains 14 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Our motivating task: Learning to generalize to new compositions of entities in images, reflected in their descriptions. Each image is represented with subject-relation-object (SRO) tuple. In a compositional split, testing is performed over novel compositions of entities observed during training, namely, all images matching a given SRO are assigned either to training or testing.
  • Figure 2: Comparing SA&T with SSVM/conv. (a) MS-COCO split. (b) Compositional split. SA&T overfits more strongly than SSVM on the compositional split. Error bars denote the standard error of the mean (SEM) across five CV folds.