STMA: A Spatio-Temporal Memory Agent for Long-Horizon Embodied Task Planning
Mingcong Lei, Yiming Zhao, Ge Wang, Zhixin Mai, Shuguang Cui, Yatong Han, Jinke Ren
TL;DR
The paper addresses long-horizon embodied task planning in dynamic environments by introducing STMA, a framework that integrates a spatio-temporal memory module with a dynamic knowledge graph and a planner-critic loop. The temporal memory compresses histories into a temporal belief while the spatial memory constructs a dynamic KG to support spatial reasoning, and the planner-critic module performs multi-step planning with real-time validation to reduce hallucinations. Empirical evaluation in TextWorld cooking tasks shows that STMA outperforms state-of-the-art baselines across difficulty levels and open-source models can achieve competitive performance, highlighting the power of memory-grounded planning. The work demonstrates the importance of explicit spatio-temporal memory for robust long-horizon decision-making in partially observable environments and points to future enhancements in memory adaptability and scalability.
Abstract
A key objective of embodied intelligence is enabling agents to perform long-horizon tasks in dynamic environments while maintaining robust decision-making and adaptability. To achieve this goal, we propose the Spatio-Temporal Memory Agent (STMA), a novel framework designed to enhance task planning and execution by integrating spatio-temporal memory. STMA is built upon three critical components: (1) a spatio-temporal memory module that captures historical and environmental changes in real time, (2) a dynamic knowledge graph that facilitates adaptive spatial reasoning, and (3) a planner-critic mechanism that iteratively refines task strategies. We evaluate STMA in the TextWorld environment on 32 tasks, involving multi-step planning and exploration under varying levels of complexity. Experimental results demonstrate that STMA achieves a 31.25% improvement in success rate and a 24.7% increase in average score compared to the state-of-the-art model. The results highlight the effectiveness of spatio-temporal memory in advancing the memory capabilities of embodied agents.
