Table of Contents
Fetching ...

STMA: A Spatio-Temporal Memory Agent for Long-Horizon Embodied Task Planning

Mingcong Lei, Yiming Zhao, Ge Wang, Zhixin Mai, Shuguang Cui, Yatong Han, Jinke Ren

TL;DR

The paper addresses long-horizon embodied task planning in dynamic environments by introducing STMA, a framework that integrates a spatio-temporal memory module with a dynamic knowledge graph and a planner-critic loop. The temporal memory compresses histories into a temporal belief while the spatial memory constructs a dynamic KG to support spatial reasoning, and the planner-critic module performs multi-step planning with real-time validation to reduce hallucinations. Empirical evaluation in TextWorld cooking tasks shows that STMA outperforms state-of-the-art baselines across difficulty levels and open-source models can achieve competitive performance, highlighting the power of memory-grounded planning. The work demonstrates the importance of explicit spatio-temporal memory for robust long-horizon decision-making in partially observable environments and points to future enhancements in memory adaptability and scalability.

Abstract

A key objective of embodied intelligence is enabling agents to perform long-horizon tasks in dynamic environments while maintaining robust decision-making and adaptability. To achieve this goal, we propose the Spatio-Temporal Memory Agent (STMA), a novel framework designed to enhance task planning and execution by integrating spatio-temporal memory. STMA is built upon three critical components: (1) a spatio-temporal memory module that captures historical and environmental changes in real time, (2) a dynamic knowledge graph that facilitates adaptive spatial reasoning, and (3) a planner-critic mechanism that iteratively refines task strategies. We evaluate STMA in the TextWorld environment on 32 tasks, involving multi-step planning and exploration under varying levels of complexity. Experimental results demonstrate that STMA achieves a 31.25% improvement in success rate and a 24.7% increase in average score compared to the state-of-the-art model. The results highlight the effectiveness of spatio-temporal memory in advancing the memory capabilities of embodied agents.

STMA: A Spatio-Temporal Memory Agent for Long-Horizon Embodied Task Planning

TL;DR

The paper addresses long-horizon embodied task planning in dynamic environments by introducing STMA, a framework that integrates a spatio-temporal memory module with a dynamic knowledge graph and a planner-critic loop. The temporal memory compresses histories into a temporal belief while the spatial memory constructs a dynamic KG to support spatial reasoning, and the planner-critic module performs multi-step planning with real-time validation to reduce hallucinations. Empirical evaluation in TextWorld cooking tasks shows that STMA outperforms state-of-the-art baselines across difficulty levels and open-source models can achieve competitive performance, highlighting the power of memory-grounded planning. The work demonstrates the importance of explicit spatio-temporal memory for robust long-horizon decision-making in partially observable environments and points to future enhancements in memory adaptability and scalability.

Abstract

A key objective of embodied intelligence is enabling agents to perform long-horizon tasks in dynamic environments while maintaining robust decision-making and adaptability. To achieve this goal, we propose the Spatio-Temporal Memory Agent (STMA), a novel framework designed to enhance task planning and execution by integrating spatio-temporal memory. STMA is built upon three critical components: (1) a spatio-temporal memory module that captures historical and environmental changes in real time, (2) a dynamic knowledge graph that facilitates adaptive spatial reasoning, and (3) a planner-critic mechanism that iteratively refines task strategies. We evaluate STMA in the TextWorld environment on 32 tasks, involving multi-step planning and exploration under varying levels of complexity. Experimental results demonstrate that STMA achieves a 31.25% improvement in success rate and a 24.7% increase in average score compared to the state-of-the-art model. The results highlight the effectiveness of spatio-temporal memory in advancing the memory capabilities of embodied agents.

Paper Structure

This paper contains 30 sections, 9 equations, 6 figures, 3 tables, 3 algorithms.

Figures (6)

  • Figure 1: Comparative overview of ReAct and STMA.(a) ReAct uses a simple history buffer to store action-feedback pairs and reasoning information, generating actions one step at a time. This approach lacks structured spatio-temporal reasoning, limiting its adaptability in complex, long-horizon tasks. (b) STMA utilizes dedicated spatial memory and temporal memory, summarized into refined spatial belief and temporal belief using the large model's capabilities. The planner-critic module enables closed-loop planning, dynamically validating and adjusting action sequences based on environmental feedback.
  • Figure 2: Overview of STMA. STMA consists of two components: a spatio-temporal memory module and a planner-critic module. The spatio-temporal memory module is divided into a temporal memory submodule and a spatial memory submodule, which provide temporal and spatial beliefs, respectively. These beliefs serve as the spatio-temporal context for the planner-critic module. The planner-critic module consists of a planner and a critic. The planner performs action planning based on the belief and generates multi-step plans in a single pass. The critic evaluates the plan before each action step, verifying whether the plan is correct and aligns with the most current environmental conditions.
  • Figure 3: Interaction with the Textworld Environment. The interaction pattern between Textworld and our framework involves the environment providing the agent with the current observation, inventory, and a list of possible actions. Based on the agent's executed actions, the environment returns feedback. These pieces of information are recorded in STMA's spatio-temporal memory, serving as the necessary context for the planner-critic agent. Within this framework, the planner and critic collaborate to generate action plans and interact with the environment.
  • Figure 4: Average score vs. steps of different frameworks (powered by GPT-4o)
  • Figure 5: STMA versus Reflexion in Case 1.
  • ...and 1 more figures