Table of Contents
Fetching ...

MMCOMET: A Large-Scale Multimodal Commonsense Knowledge Graph for Contextual Reasoning

Eileen Wang, Hiba Arnaout, Dhita Pratama, Shuo Yang, Dangyang Liu, Jie Yang, Josiah Poon, Jeff Pan, Caren Han

TL;DR

MMCOMET is presented, the first multimodal commonsense knowledge graph (MMKG) that integrates physical, social, and eventive knowledge that establishes a new foundation for multimodal commonsense reasoning and narrative generation.

Abstract

We present MMCOMET, the first multimodal commonsense knowledge graph (MMKG) that integrates physical, social, and eventive knowledge. MMCOMET extends the ATOMIC2020 knowledge graph to include a visual dimension, through an efficient image retrieval process, resulting in over 900K multimodal triples. This new resource addresses a major limitation of existing MMKGs in supporting complex reasoning tasks like image captioning and storytelling. Through a standard visual storytelling experiment, we show that our holistic approach enables the generation of richer, coherent, and contextually grounded stories than those produced using text-only knowledge. This resource establishes a new foundation for multimodal commonsense reasoning and narrative generation.

MMCOMET: A Large-Scale Multimodal Commonsense Knowledge Graph for Contextual Reasoning

TL;DR

MMCOMET is presented, the first multimodal commonsense knowledge graph (MMKG) that integrates physical, social, and eventive knowledge that establishes a new foundation for multimodal commonsense reasoning and narrative generation.

Abstract

We present MMCOMET, the first multimodal commonsense knowledge graph (MMKG) that integrates physical, social, and eventive knowledge. MMCOMET extends the ATOMIC2020 knowledge graph to include a visual dimension, through an efficient image retrieval process, resulting in over 900K multimodal triples. This new resource addresses a major limitation of existing MMKGs in supporting complex reasoning tasks like image captioning and storytelling. Through a standard visual storytelling experiment, we show that our holistic approach enables the generation of richer, coherent, and contextually grounded stories than those produced using text-only knowledge. This resource establishes a new foundation for multimodal commonsense reasoning and narrative generation.
Paper Structure (31 sections, 1 equation, 2 figures, 8 tables)

This paper contains 31 sections, 1 equation, 2 figures, 8 tables.

Figures (2)

  • Figure 1: An example of automated visual storytelling: Baseline: Family members enjoyed leisurely moments together. Grandpa shared memories during the trip; Ours: Family spent the day relaxing in the boat, enjoying beer. Grandpa took the grandson on his lap as he drove, with the grandson's parents watching nearby.
  • Figure 2: A small subset of MMCOMET (top) and the Construction Pipeline (bottom).