Table of Contents
Fetching ...

Visual Madlibs: Fill in the blank Image Generation and Question Answering

Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg

TL;DR

Visual Madlibs addresses the need for targeted image-language descriptions by collecting fill-in-the-blank prompts to elicit focused details about people, objects, and scene context. The authors build a large dataset (360k responses for 10,738 COCO images) across 12 prompt types and evaluate targeted generation and multiple-choice image QA using joint-embedding (CCA/nCCA) and CNN-LSTM methods. Analyses show Madlibs descriptions are more detailed and cover content beyond generic captions, and outperform COCO-based descriptions in the proposed QA task, with bounding-box cues aiding attribute questions. The work provides new data, tasks, and baselines to drive focused image-language understanding and will release resources publicly.

Abstract

In this paper, we introduce a new dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset, the Visual Madlibs dataset, is collected using automatically produced fill-in-the-blank templates designed to gather targeted descriptions about: people and objects, their appearances, activities, and interactions, as well as inferences about the general scene or its broader context. We provide several analyses of the Visual Madlibs dataset and demonstrate its applicability to two new description generation tasks: focused description generation, and multiple-choice question-answering for images. Experiments using joint-embedding and deep learning methods show promising results on these tasks.

Visual Madlibs: Fill in the blank Image Generation and Question Answering

TL;DR

Visual Madlibs addresses the need for targeted image-language descriptions by collecting fill-in-the-blank prompts to elicit focused details about people, objects, and scene context. The authors build a large dataset (360k responses for 10,738 COCO images) across 12 prompt types and evaluate targeted generation and multiple-choice image QA using joint-embedding (CCA/nCCA) and CNN-LSTM methods. Analyses show Madlibs descriptions are more detailed and cover content beyond generic captions, and outperform COCO-based descriptions in the proposed QA task, with bounding-box cues aiding attribute questions. The work provides new data, tasks, and baselines to drive focused image-language understanding and will release resources publicly.

Abstract

In this paper, we introduce a new dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset, the Visual Madlibs dataset, is collected using automatically produced fill-in-the-blank templates designed to gather targeted descriptions about: people and objects, their appearances, activities, and interactions, as well as inferences about the general scene or its broader context. We provide several analyses of the Visual Madlibs dataset and demonstrate its applicability to two new description generation tasks: focused description generation, and multiple-choice question-answering for images. Experiments using joint-embedding and deep learning methods show promising results on these tasks.

Paper Structure

This paper contains 11 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: An example from the Visual Madlibs Dataset. This dataset collects targeted descriptions for people and objects, denoting their appearances, affordances, activities, and interactions. It also provides descriptions of broader emotional, spatial and temporal context for an image.
  • Figure 2: Madlibs description. The first row corresponds to question types 1-5, the second row corresponds to question types 9-11, and the third row is to question types 6-8 and question type 12. All question types are listed in Table \ref{['table:question']}.
  • Figure 3: COCO instance annotation and descriptions for the image of Fig. \ref{['fig:frisbee']}. We show how we map labeled instances to the mentioned person and object in the sentence.
  • Figure 4: First row shows top-5 most frequent phrase templates for image's future, object's attribute, object's affordance and person's activity. Second row shows the histograms of similarity between answers.
  • Figure 5: Template used for parsing person's attributes, activity and interaction with object, and object's attribute. The percentages below compares Madlibs and MSCOCO on how frequent these templates are used for description.
  • ...and 3 more figures