Table of Contents
Fetching ...

MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments

Zhengsheng Guo, Linwei Zheng, Xinyang Chen, Xuefeng Bai, Kehai Chen, Min Zhang

TL;DR

This work addresses the limitation of single-corpus Retrieval-Augmented Generation by introducing MoK-RAG, a Mixture of Knowledge Paths framework that partitions an LLM corpus into multiple specialized knowledge paths for concurrent retrieval. It extends the framework to Embodied AI 3D environment generation with MoK-RAG3D, incorporating a Splitting Module, a Constraint Module, and a dedicated Layout Module to produce cohesive, diverse scenes via a hierarchical knowledge tree and explicit spatial relations. Empirical results show reduced Reply Missing, improved asset selection and layout coherence, and competitive scene quality compared to HOLODECK, with automated and human evaluations validating effectiveness in generating varied 3D environments. The work demonstrates the practical value of multi-path knowledge retrieval for Embodied AI and provides a foundation for automated, scalable 3D scene generation and evaluation, albeit with current hardware testing limitations for real robots.

Abstract

While human cognition inherently retrieves information from diverse and specialized knowledge sources during decision-making processes, current Retrieval-Augmented Generation (RAG) systems typically operate through single-source knowledge retrieval, leading to a cognitive-algorithmic discrepancy. To bridge this gap, we introduce MoK-RAG, a novel multi-source RAG framework that implements a mixture of knowledge paths enhanced retrieval mechanism through functional partitioning of a large language model (LLM) corpus into distinct sections, enabling retrieval from multiple specialized knowledge paths. Applied to the generation of 3D simulated environments, our proposed MoK-RAG3D enhances this paradigm by partitioning 3D assets into distinct sections and organizing them based on a hierarchical knowledge tree structure. Different from previous methods that only use manual evaluation, we pioneered the introduction of automated evaluation methods for 3D scenes. Both automatic and human evaluations in our experiments demonstrate that MoK-RAG3D can assist Embodied AI agents in generating diverse scenes.

MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments

TL;DR

This work addresses the limitation of single-corpus Retrieval-Augmented Generation by introducing MoK-RAG, a Mixture of Knowledge Paths framework that partitions an LLM corpus into multiple specialized knowledge paths for concurrent retrieval. It extends the framework to Embodied AI 3D environment generation with MoK-RAG3D, incorporating a Splitting Module, a Constraint Module, and a dedicated Layout Module to produce cohesive, diverse scenes via a hierarchical knowledge tree and explicit spatial relations. Empirical results show reduced Reply Missing, improved asset selection and layout coherence, and competitive scene quality compared to HOLODECK, with automated and human evaluations validating effectiveness in generating varied 3D environments. The work demonstrates the practical value of multi-path knowledge retrieval for Embodied AI and provides a foundation for automated, scalable 3D scene generation and evaluation, albeit with current hardware testing limitations for real robots.

Abstract

While human cognition inherently retrieves information from diverse and specialized knowledge sources during decision-making processes, current Retrieval-Augmented Generation (RAG) systems typically operate through single-source knowledge retrieval, leading to a cognitive-algorithmic discrepancy. To bridge this gap, we introduce MoK-RAG, a novel multi-source RAG framework that implements a mixture of knowledge paths enhanced retrieval mechanism through functional partitioning of a large language model (LLM) corpus into distinct sections, enabling retrieval from multiple specialized knowledge paths. Applied to the generation of 3D simulated environments, our proposed MoK-RAG3D enhances this paradigm by partitioning 3D assets into distinct sections and organizing them based on a hierarchical knowledge tree structure. Different from previous methods that only use manual evaluation, we pioneered the introduction of automated evaluation methods for 3D scenes. Both automatic and human evaluations in our experiments demonstrate that MoK-RAG3D can assist Embodied AI agents in generating diverse scenes.

Paper Structure

This paper contains 30 sections, 5 equations, 12 figures.

Figures (12)

  • Figure 1: A figure showing the difference between human beings and LLM agents. In human cognition, decisions are often made by retrieving information from diverse knowledge sources. However, current Retrieval-Augmented Generation (RAG) systems typically rely on a single knowledge corpus.
  • Figure 2: Overview of MoK-RAG and MoK-RAG3D. MoK-RAG consists of Splitting and Constraint modules for multi-source knowledge retrieval and Generation Module for response generation. MoK-RAG3D refines the Generation Module of the MoK-RAG framework into a dedicated Layout Module to facilitate scene generation.
  • Figure 3: The Occurance Rate of Problem Reply Missing from main objects and paired objetcs two aspects.
  • Figure 4: Comparative human evaluation of MoK-RAG3D and HOLODECK across three criteria. The pie charts show the distribution of annotator preferences, showing both the percentage and the actual number of annotations favoring each system.
  • Figure 5: Human evaluation on 52 scene types from MIT Scenes Dataset mitscene with qualitative examples. The two horizontal lines represent the average score of MoK-RAG3D and HOLODECK on four types of residential scenes (bedroom, living room, bathroom and kitchen.)
  • ...and 7 more figures