Table of Contents
Fetching ...

Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph

Wentian Zhao, Yao Hu, Heda Wang, Xinxiao Wu, Jiebo Luo

TL;DR

This work tackles entity-aware image captioning under long-tail named entities by constructing a multi-modal knowledge graph (MMKG) that links visual objects to article entities and encodes fine-grained entity relationships. A cross-modal entity matching module, trained with a Wikipedia-based external knowledge base, connects a text sub-graph and an image sub-graph to form MMKG, which is then integrated into a graph-attention captioning model. Empirical results on GoodNews and NYTimes800k show improvements in standard captioning metrics and higher entity F1, validating the effectiveness of grounding captions in external multi-modal knowledge. The framework highlights the potential of external knowledge to ground image descriptions in concrete events and entities, with future work aimed at richer linguistic features and decoders.

Abstract

Entity-aware image captioning aims to describe named entities and events related to the image by utilizing the background knowledge in the associated article. This task remains challenging as it is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities. Furthermore, the complexity of the article brings difficulty in extracting fine-grained relationships between entities to generate informative event descriptions about the image. To tackle these challenges, we propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities and capture the relationship between entities simultaneously with the help of external knowledge collected from the web. Specifically, we build a text sub-graph by extracting named entities and their relationships from the article, and build an image sub-graph by detecting the objects in the image. To connect these two sub-graphs, we propose a cross-modal entity matching module trained using a knowledge base that contains Wikipedia entries and the corresponding images. Finally, the multi-modal knowledge graph is integrated into the captioning model via a graph attention mechanism. Extensive experiments on both GoodNews and NYTimes800k datasets demonstrate the effectiveness of our method.

Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph

TL;DR

This work tackles entity-aware image captioning under long-tail named entities by constructing a multi-modal knowledge graph (MMKG) that links visual objects to article entities and encodes fine-grained entity relationships. A cross-modal entity matching module, trained with a Wikipedia-based external knowledge base, connects a text sub-graph and an image sub-graph to form MMKG, which is then integrated into a graph-attention captioning model. Empirical results on GoodNews and NYTimes800k show improvements in standard captioning metrics and higher entity F1, validating the effectiveness of grounding captions in external multi-modal knowledge. The framework highlights the potential of external knowledge to ground image descriptions in concrete events and entities, with future work aimed at richer linguistic features and decoders.

Abstract

Entity-aware image captioning aims to describe named entities and events related to the image by utilizing the background knowledge in the associated article. This task remains challenging as it is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities. Furthermore, the complexity of the article brings difficulty in extracting fine-grained relationships between entities to generate informative event descriptions about the image. To tackle these challenges, we propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities and capture the relationship between entities simultaneously with the help of external knowledge collected from the web. Specifically, we build a text sub-graph by extracting named entities and their relationships from the article, and build an image sub-graph by detecting the objects in the image. To connect these two sub-graphs, we propose a cross-modal entity matching module trained using a knowledge base that contains Wikipedia entries and the corresponding images. Finally, the multi-modal knowledge graph is integrated into the captioning model via a graph attention mechanism. Extensive experiments on both GoodNews and NYTimes800k datasets demonstrate the effectiveness of our method.

Paper Structure

This paper contains 18 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An example of entity-aware image captioning. The named entities in type "PERSON" are marked in red.
  • Figure 2: The framework of our proposed method. The left part shows the external multi-modal knowledge base containing named entities and their corresponding images, which are used to train the cross-modal entity matching module. The middle part shows the generation process of the multi-modal knowledge graphs. An image sub-graph and a text sub-graph are extracted from the input image and the article text, respectively. The multi-modal entity matching module connects the related entities in the two sub-graphs to construct the multi-modal knowledge graph. The right part shows the captioning model, which encodes the image, the article and the multi-modal knowledge graph to generate an entity-aware caption.
  • Figure 3: An example of the constructed multi-modal knowledge graph, which consists of an image sub-graph (the left part of the box) and a text sub-graph (the right part of the box).
  • Figure 4: Qualitative results on the GoodNews dataset ((a), (b)) and the NYTimes800k dataset ((c), (d)). "ground-truth", "w/o graph" and "ours" denote the ground-truth caption, the caption generated without using the multi-modal knowledge graph, and the caption generated by our method, respectively. The named entities in the captions are colored and underlined. Due to space limitation, the rightmost column only shows part of the constructed multi-modal knowledge graphs.