Table of Contents
Fetching ...

Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling

Ji Li, Jing Xia, Mingyi Li, Shiyan Hu

TL;DR

The paper addresses the challenge of using Multimodal Large Language Models to drive embodied agents over long horizons with limited context. It introduces a non-parametric memory framework that cleanly separates episodic memory (for recalling past experiences) from semantic memory (for reusable rules), employing a retrieval-first, reasoning-assisted pipeline with visual verification and a program-style rule extraction mechanism. Semantic memory is grounded in test-time, decoupled semantic-physical spaces, enabling cross-environment generalization and robust decision making. Empirical results on A-EQA/OpenEQA and GOAT-Bench demonstrate state-of-the-art performance, reflecting improvements in exploration efficiency and complex reasoning, and highlighting complementary benefits of episodic and semantic memory for embodied tasks.

Abstract

Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.

Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling

TL;DR

The paper addresses the challenge of using Multimodal Large Language Models to drive embodied agents over long horizons with limited context. It introduces a non-parametric memory framework that cleanly separates episodic memory (for recalling past experiences) from semantic memory (for reusable rules), employing a retrieval-first, reasoning-assisted pipeline with visual verification and a program-style rule extraction mechanism. Semantic memory is grounded in test-time, decoupled semantic-physical spaces, enabling cross-environment generalization and robust decision making. Empirical results on A-EQA/OpenEQA and GOAT-Bench demonstrate state-of-the-art performance, reflecting improvements in exploration efficiency and complex reasoning, and highlighting complementary benefits of episodic and semantic memory for embodied tasks.

Abstract

Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.
Paper Structure (12 sections, 5 equations, 3 figures, 3 tables)

This paper contains 12 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the proposed pipeline. The agent integrates cognitive meta memory, episodic memory, and semantic memory to guide efficient exploration and accurate reasoning across sequential navigation subtasks.
  • Figure 2: Qualitative case study on A-EQA. Compared with the baseline 3D-Mem, our method demonstrates accurate cross-episode recall and more efficient exploration.
  • Figure 3: Qualitative examples of how program-style enhanced rules for semantic memory impact the reasoning trajectory for answering questions. We highlight failed reasoning trajectory in red and successful ones in green.