Learning Visually Grounded Sentence Representations
Douwe Kiela, Alexis Conneau, Allan Jabri, Maximilian Nickel
TL;DR
The paper addresses universal sentence representations by grounding them in visual data. It introduces Cap2Img, Cap2Cap, and Cap2Both as grounding variants and combines them with language-only signals to form GroundSent representations. Empirical results show improved COCO image-caption retrieval and consistent gains on NLP transfer tasks over text-only baselines, with analyses attributing improvements to qualitative grounding information rather than mere data/parameter increases. The work also demonstrates that grounded word embeddings enhance standard lexical similarity benchmarks, suggesting broader benefits for semantics when grounding is incorporated. This approach offers a practical pathway to richer, multimodal sentence representations with potential applications in visual grounding tasks and beyond.
Abstract
We introduce a variety of models, trained on a supervised image captioning corpus to predict the image features for a given caption, to perform sentence representation grounding. We train a grounded sentence encoder that achieves good performance on COCO caption and image retrieval and subsequently show that this encoder can successfully be transferred to various NLP tasks, with improved performance over text-only models. Lastly, we analyze the contribution of grounding, and show that word embeddings learned by this system outperform non-grounded ones.
