Table of Contents
Fetching ...

Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI

Song Yaoxian, Sun Penglei, Liu Haoyu, Li Zhixu, Song Wei, Xiao Yanghua, Zhou Xiaofang

TL;DR

This work addresses the need for reliable, scene-specific knowledge in embodied AI by introducing Scene-MMKG, a multimodal knowledge graph designed for a particular scene and integrated with embodied tasks. It couples prompt-based schema design, concept mining, ontology expansion, and two-source knowledge population (general and scene-oriented) with quality control to produce a compact yet rich knowledge base, instantiated as ManipMob-MMKG for indoor mobility and manipulation tasks. A Scene Knowledge Retrieval and encoding pipeline inserts this knowledge into Visual Language Navigation and 3D Object Language Grounding, yielding observable improvements over general or domain-specific baselines, with multimodal data and denoising proving particularly beneficial. The approach demonstrates data-efficient construction, scalable granularity, and practical gains in downstream tasks, highlighting the potential of scene-driven knowledge injection to enhance robustness and interpretability in embodied AI applications.

Abstract

Embodied AI is one of the most popular studies in artificial intelligence and robotics, which can effectively improve the intelligence of real-world agents (i.e. robots) serving human beings. Scene knowledge is important for an agent to understand the surroundings and make correct decisions in the varied open world. Currently, knowledge base for embodied tasks is missing and most existing work use general knowledge base or pre-trained models to enhance the intelligence of an agent. For conventional knowledge base, it is sparse, insufficient in capacity and cost in data collection. For pre-trained models, they face the uncertainty of knowledge and hard maintenance. To overcome the challenges of scene knowledge, we propose a scene-driven multimodal knowledge graph (Scene-MMKG) construction method combining conventional knowledge engineering and large language models. A unified scene knowledge injection framework is introduced for knowledge representation. To evaluate the advantages of our proposed method, we instantiate Scene-MMKG considering typical indoor robotic functionalities (Manipulation and Mobility), named ManipMob-MMKG. Comparisons in characteristics indicate our instantiated ManipMob-MMKG has broad superiority in data-collection efficiency and knowledge quality. Experimental results on typical embodied tasks show that knowledge-enhanced methods using our instantiated ManipMob-MMKG can improve the performance obviously without re-designing model structures complexly. Our project can be found at https://sites.google.com/view/manipmob-mmkg

Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI

TL;DR

This work addresses the need for reliable, scene-specific knowledge in embodied AI by introducing Scene-MMKG, a multimodal knowledge graph designed for a particular scene and integrated with embodied tasks. It couples prompt-based schema design, concept mining, ontology expansion, and two-source knowledge population (general and scene-oriented) with quality control to produce a compact yet rich knowledge base, instantiated as ManipMob-MMKG for indoor mobility and manipulation tasks. A Scene Knowledge Retrieval and encoding pipeline inserts this knowledge into Visual Language Navigation and 3D Object Language Grounding, yielding observable improvements over general or domain-specific baselines, with multimodal data and denoising proving particularly beneficial. The approach demonstrates data-efficient construction, scalable granularity, and practical gains in downstream tasks, highlighting the potential of scene-driven knowledge injection to enhance robustness and interpretability in embodied AI applications.

Abstract

Embodied AI is one of the most popular studies in artificial intelligence and robotics, which can effectively improve the intelligence of real-world agents (i.e. robots) serving human beings. Scene knowledge is important for an agent to understand the surroundings and make correct decisions in the varied open world. Currently, knowledge base for embodied tasks is missing and most existing work use general knowledge base or pre-trained models to enhance the intelligence of an agent. For conventional knowledge base, it is sparse, insufficient in capacity and cost in data collection. For pre-trained models, they face the uncertainty of knowledge and hard maintenance. To overcome the challenges of scene knowledge, we propose a scene-driven multimodal knowledge graph (Scene-MMKG) construction method combining conventional knowledge engineering and large language models. A unified scene knowledge injection framework is introduced for knowledge representation. To evaluate the advantages of our proposed method, we instantiate Scene-MMKG considering typical indoor robotic functionalities (Manipulation and Mobility), named ManipMob-MMKG. Comparisons in characteristics indicate our instantiated ManipMob-MMKG has broad superiority in data-collection efficiency and knowledge quality. Experimental results on typical embodied tasks show that knowledge-enhanced methods using our instantiated ManipMob-MMKG can improve the performance obviously without re-designing model structures complexly. Our project can be found at https://sites.google.com/view/manipmob-mmkg
Paper Structure (35 sections, 11 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 35 sections, 11 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: The illustration of Scene-MMKG construction and embodied tasks using Scene-MMKG. Given the instruction and multimodal perception, the agent is required to retrieve knowledge and answer the "where" and "which" question.
  • Figure 2: Given the scene profiles, we design a prompt-based schema based on LLMs and then populate multimodal knowledge guided by the schema to construct our Scene-MMKG. Scene-MMKG is refined by hierarchicalization and aggregation for attributes to resolve long-tail problems.
  • Figure 3: The overview of the scene-driven knowledge enhancement model is shown in the left panel. The right panel is the details about scene knowledge retrieval module.
  • Figure 4: Visualization of VLN results of CKR-ManipMob-MMKG (left) and CKR-ConceptNet (right).
  • Figure 5: Visualization of 3D object language grounding cases and the process of scene knowledge retrieving.
  • ...and 1 more figures