Table of Contents
Fetching ...

M^2ConceptBase: A Fine-Grained Aligned Concept-Centric Multimodal Knowledge Base

Zhiwei Zha, Jiaan Wang, Zhixu Li, Xiangru Zhu, Wei Song, Yanghua Xiao

TL;DR

This work addresses the scarcity of fine-grained visual concept knowledge by introducing M^2ConceptBase, the first concept-centric multimodal knowledge base (MMKB). It presents a three-stage construction framework—candidate concept mining, context-aware multimodal symbol grounding, and description completion—that leverages large-scale image-text data and encyclopedia sources to produce 151{,}776 concepts linked to 951{,}089 images and detailed concept descriptions. The authors demonstrate high cross-modal alignment quality (over 95% in concept–description and concept–image alignment) and show practical benefits: improved OK-VQA performance and enhanced fine-grained concept understanding in retrieval-augmented multimodal large language models. Overall, M2ConceptBase provides rich, context-aware concept alignments and supports retrieval-augmented reasoning, offering substantial value for vision-language tasks and MLLMs.

Abstract

Multimodal knowledge bases (MMKBs) provide cross-modal aligned knowledge crucial for multimodal tasks. However, the images in existing MMKBs are generally collected for entities in encyclopedia knowledge graphs. Therefore, detailed groundings of visual semantics with linguistic concepts are lacking, which are essential for the visual concept cognition ability of multimodal models. Addressing this gap, we introduce M^2ConceptBase, the first concept-centric MMKB. M^2ConceptBase models concepts as nodes with associated images and detailed textual descriptions. We propose a context-aware multimodal symbol grounding approach to align concept-image and concept-description pairs using context information from image-text datasets. Comprising 951K images and 152K concepts, M^2ConceptBase links each concept to an average of 6.27 images and a single description, ensuring comprehensive visual and textual semantics. Human studies confirm more than 95% alignment accuracy, underscoring its quality. Additionally, our experiments demonstrate that M^2ConceptBase significantly enhances VQA model performance on the OK-VQA task. M^2ConceptBase also substantially improves the fine-grained concept understanding capabilities of multimodal large language models through retrieval augmentation in two concept-related tasks, highlighting its value.

M^2ConceptBase: A Fine-Grained Aligned Concept-Centric Multimodal Knowledge Base

TL;DR

This work addresses the scarcity of fine-grained visual concept knowledge by introducing M^2ConceptBase, the first concept-centric multimodal knowledge base (MMKB). It presents a three-stage construction framework—candidate concept mining, context-aware multimodal symbol grounding, and description completion—that leverages large-scale image-text data and encyclopedia sources to produce 151{,}776 concepts linked to 951{,}089 images and detailed concept descriptions. The authors demonstrate high cross-modal alignment quality (over 95% in concept–description and concept–image alignment) and show practical benefits: improved OK-VQA performance and enhanced fine-grained concept understanding in retrieval-augmented multimodal large language models. Overall, M2ConceptBase provides rich, context-aware concept alignments and supports retrieval-augmented reasoning, offering substantial value for vision-language tasks and MLLMs.

Abstract

Multimodal knowledge bases (MMKBs) provide cross-modal aligned knowledge crucial for multimodal tasks. However, the images in existing MMKBs are generally collected for entities in encyclopedia knowledge graphs. Therefore, detailed groundings of visual semantics with linguistic concepts are lacking, which are essential for the visual concept cognition ability of multimodal models. Addressing this gap, we introduce M^2ConceptBase, the first concept-centric MMKB. M^2ConceptBase models concepts as nodes with associated images and detailed textual descriptions. We propose a context-aware multimodal symbol grounding approach to align concept-image and concept-description pairs using context information from image-text datasets. Comprising 951K images and 152K concepts, M^2ConceptBase links each concept to an average of 6.27 images and a single description, ensuring comprehensive visual and textual semantics. Human studies confirm more than 95% alignment accuracy, underscoring its quality. Additionally, our experiments demonstrate that M^2ConceptBase significantly enhances VQA model performance on the OK-VQA task. M^2ConceptBase also substantially improves the fine-grained concept understanding capabilities of multimodal large language models through retrieval augmentation in two concept-related tasks, highlighting its value.
Paper Structure (15 sections, 10 equations, 6 figures, 9 tables)

This paper contains 15 sections, 10 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: (a) Examples of sub-optimal fine-grained concept understanding in GPT-4V and miniGPT-4, highlighting the need for concept-centric MMKBs. (b) Entity-centric MMKBs. (c) Concept-centric MMKBs.
  • Figure 2: Our framework for large-scale concept-centric multimodal knowledge base construction. In step 1, we mine candidate concepts from large-scale image-text pairs by tokenizing their textual descriptions and filtering the tokenized results by rule-based strategies. In step 2, we ground each candidate concept with concept-relevant images and detailed concept descriptions. In step 3, we generate concept descriptions for those concepts that failed to be grounded in step 2.
  • Figure 3: Example concept nodes in M2ConceptBase.
  • Figure 4: Distribution of the number of concepts associated with different numbers (1$\backsim$20) of images in our M2ConceptBase.
  • Figure 5: Illustration of OK-VQA method equipped with M2ConceptBase and LLM.
  • ...and 1 more figures