MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments
Yang Liu, Xinshuai Song, Kaixuan Jiang, Weixing Chen, Jingzhou Luo, Guanbin Li, Liang Lin
TL;DR
MEIA addresses the challenge of grounding embodied AI planning by introducing a Multimodal Environment Memory (MEM) that stores both language-based scene descriptions and visual representations (including a floor plan). By coupling MEM with vision-language and large language models, MEIA translates natural language goals into executable action sequences and robustly executes them in a cafe-style simulator, with an embodied QA dataset to evaluate reasoning in dynamic environments. Key contributions include MEM design (environmental language and image memory), an embodied QA dataset, and a comprehensive zero-shot evaluation across sub-tasks, planning, and full pipelines demonstrating near-top-tier performance and the critical role of memory in executable planning. The work advances practical embodied perception and interaction, enabling more reliable, grounded, and scalable robot–human collaboration in unknown environments.
Abstract
With the surge in the development of large language models, embodied intelligence has attracted increasing attention. Nevertheless, prior works on embodied intelligence typically encode scene or historical memory in an unimodal manner, either visual or linguistic, which complicates the alignment of the model's action planning with embodied control. To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. Specifically, we propose a novel Multimodal Environment Memory (MEM) module, facilitating the integration of embodied control with large models through the visual-language memory of scenes. This capability enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities. Furthermore, we construct an embodied question answering dataset based on a dynamic virtual cafe environment with the help of the large language model. In this virtual environment, we conduct several experiments, utilizing multiple large models through zero-shot learning, and carefully design scenarios for various situations. The experimental results showcase the promising performance of our MEIA in various embodied interactive tasks.
