Visual Madlibs: Fill in the blank Image Generation and Question Answering
Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg
TL;DR
Visual Madlibs addresses the need for targeted image-language descriptions by collecting fill-in-the-blank prompts to elicit focused details about people, objects, and scene context. The authors build a large dataset (360k responses for 10,738 COCO images) across 12 prompt types and evaluate targeted generation and multiple-choice image QA using joint-embedding (CCA/nCCA) and CNN-LSTM methods. Analyses show Madlibs descriptions are more detailed and cover content beyond generic captions, and outperform COCO-based descriptions in the proposed QA task, with bounding-box cues aiding attribute questions. The work provides new data, tasks, and baselines to drive focused image-language understanding and will release resources publicly.
Abstract
In this paper, we introduce a new dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset, the Visual Madlibs dataset, is collected using automatically produced fill-in-the-blank templates designed to gather targeted descriptions about: people and objects, their appearances, activities, and interactions, as well as inferences about the general scene or its broader context. We provide several analyses of the Visual Madlibs dataset and demonstrate its applicability to two new description generation tasks: focused description generation, and multiple-choice question-answering for images. Experiments using joint-embedding and deep learning methods show promising results on these tasks.
