MMCOMET: A Large-Scale Multimodal Commonsense Knowledge Graph for Contextual Reasoning

Eileen Wang; Hiba Arnaout; Dhita Pratama; Shuo Yang; Dangyang Liu; Jie Yang; Josiah Poon; Jeff Pan; Caren Han

MMCOMET: A Large-Scale Multimodal Commonsense Knowledge Graph for Contextual Reasoning

Eileen Wang, Hiba Arnaout, Dhita Pratama, Shuo Yang, Dangyang Liu, Jie Yang, Josiah Poon, Jeff Pan, Caren Han

TL;DR

MMCOMET is presented, the first multimodal commonsense knowledge graph (MMKG) that integrates physical, social, and eventive knowledge that establishes a new foundation for multimodal commonsense reasoning and narrative generation.

Abstract

We present MMCOMET, the first multimodal commonsense knowledge graph (MMKG) that integrates physical, social, and eventive knowledge. MMCOMET extends the ATOMIC2020 knowledge graph to include a visual dimension, through an efficient image retrieval process, resulting in over 900K multimodal triples. This new resource addresses a major limitation of existing MMKGs in supporting complex reasoning tasks like image captioning and storytelling. Through a standard visual storytelling experiment, we show that our holistic approach enables the generation of richer, coherent, and contextually grounded stories than those produced using text-only knowledge. This resource establishes a new foundation for multimodal commonsense reasoning and narrative generation.

MMCOMET: A Large-Scale Multimodal Commonsense Knowledge Graph for Contextual Reasoning

TL;DR

Abstract

Paper Structure (31 sections, 1 equation, 2 figures, 8 tables)

This paper contains 31 sections, 1 equation, 2 figures, 8 tables.

Introduction
Related Work
Text-only Commonsense Knowledge Graphs
Multimodal Knowledge Graphs
Positioning of MMCOMET
Method
Base Commonsense Topology
Hybrid Visual Alignment Pipeline
Embedding-based Similarity Matching
Concreteness-aware Web Retrieval
Post-processing and Image Selection
Experimental Setup
Datasets
Implementation Details
Intrinsic Evaluation: Knowledge Quality
...and 16 more sections

Figures (2)

Figure 1: An example of automated visual storytelling: Baseline: Family members enjoyed leisurely moments together. Grandpa shared memories during the trip; Ours: Family spent the day relaxing in the boat, enjoying beer. Grandpa took the grandson on his lap as he drove, with the grandson's parents watching nearby.
Figure 2: A small subset of MMCOMET (top) and the Construction Pipeline (bottom).

MMCOMET: A Large-Scale Multimodal Commonsense Knowledge Graph for Contextual Reasoning

TL;DR

Abstract

MMCOMET: A Large-Scale Multimodal Commonsense Knowledge Graph for Contextual Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (2)