Table of Contents
Fetching ...

MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments

Yang Liu, Xinshuai Song, Kaixuan Jiang, Weixing Chen, Jingzhou Luo, Guanbin Li, Liang Lin

TL;DR

MEIA addresses the challenge of grounding embodied AI planning by introducing a Multimodal Environment Memory (MEM) that stores both language-based scene descriptions and visual representations (including a floor plan). By coupling MEM with vision-language and large language models, MEIA translates natural language goals into executable action sequences and robustly executes them in a cafe-style simulator, with an embodied QA dataset to evaluate reasoning in dynamic environments. Key contributions include MEM design (environmental language and image memory), an embodied QA dataset, and a comprehensive zero-shot evaluation across sub-tasks, planning, and full pipelines demonstrating near-top-tier performance and the critical role of memory in executable planning. The work advances practical embodied perception and interaction, enabling more reliable, grounded, and scalable robot–human collaboration in unknown environments.

Abstract

With the surge in the development of large language models, embodied intelligence has attracted increasing attention. Nevertheless, prior works on embodied intelligence typically encode scene or historical memory in an unimodal manner, either visual or linguistic, which complicates the alignment of the model's action planning with embodied control. To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. Specifically, we propose a novel Multimodal Environment Memory (MEM) module, facilitating the integration of embodied control with large models through the visual-language memory of scenes. This capability enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities. Furthermore, we construct an embodied question answering dataset based on a dynamic virtual cafe environment with the help of the large language model. In this virtual environment, we conduct several experiments, utilizing multiple large models through zero-shot learning, and carefully design scenarios for various situations. The experimental results showcase the promising performance of our MEIA in various embodied interactive tasks.

MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments

TL;DR

MEIA addresses the challenge of grounding embodied AI planning by introducing a Multimodal Environment Memory (MEM) that stores both language-based scene descriptions and visual representations (including a floor plan). By coupling MEM with vision-language and large language models, MEIA translates natural language goals into executable action sequences and robustly executes them in a cafe-style simulator, with an embodied QA dataset to evaluate reasoning in dynamic environments. Key contributions include MEM design (environmental language and image memory), an embodied QA dataset, and a comprehensive zero-shot evaluation across sub-tasks, planning, and full pipelines demonstrating near-top-tier performance and the critical role of memory in executable planning. The work advances practical embodied perception and interaction, enabling more reliable, grounded, and scalable robot–human collaboration in unknown environments.

Abstract

With the surge in the development of large language models, embodied intelligence has attracted increasing attention. Nevertheless, prior works on embodied intelligence typically encode scene or historical memory in an unimodal manner, either visual or linguistic, which complicates the alignment of the model's action planning with embodied control. To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. Specifically, we propose a novel Multimodal Environment Memory (MEM) module, facilitating the integration of embodied control with large models through the visual-language memory of scenes. This capability enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities. Furthermore, we construct an embodied question answering dataset based on a dynamic virtual cafe environment with the help of the large language model. In this virtual environment, we conduct several experiments, utilizing multiple large models through zero-shot learning, and carefully design scenarios for various situations. The experimental results showcase the promising performance of our MEIA in various embodied interactive tasks.
Paper Structure (23 sections, 7 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 7 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: We propose MEIA, a model that decomposes high-level language instructions into a series of executable actions.
  • Figure 2: The structure diagram of MEIA. MEIA is implemented through three functional modules: the vision module, the control module, and the large model. The multimodal environmental memory generated by the vision module will serve as a bridge between the control module and the large model, enabling them to work collaboratively to complete tasks, and achieving efficient integration of large model perception, memory, and embodied control.
  • Figure 3: Construction process of an environmental floor plan.
  • Figure 4: Task execution. We present three tasks, each of which has a brief description of the implementation process.
  • Figure 5: The result for instruction planning evaluation. -s for short instructions and -l for long instructions.
  • ...and 1 more figures