Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents
Ye Ye
TL;DR
The paper tackles the fragility of LLM-based agents in multi-step tasks by replacing linear context with a graph-based Task Memory Engine (TME) that supports revision-aware reasoning without fine-tuning. A central component, the Task Memory Structure (TMS) implemented as a DAG, tracks subtasks, dependencies, and revisions, while TRIM models user intent to generate compact, consistent subgraphs for LLM prompts. Across four case studies (trip planning, cooking, meeting scheduling, cart editing), TME-DAG substantially reduces hallucinations and misinterpretations, achieving up to 100% reductions and full task consistency, outperforming ReAct in complex, revision-heavy scenarios. The approach also demonstrates token efficiency through selective memory retrieval and promises plug-and-play deployment with open-source code and benchmarks, enabling broader adoption for reliable LLM agents. The work lays groundwork for future enhancements with Graph Neural Networks and loop-aware reasoning to handle more intricate, enterprise-scale workflows.
Abstract
Large Language Models (LLMs) falter in multi-step interactions -- often hallucinating, repeating actions, or misinterpreting user corrections -- due to reliance on linear, unstructured context. This fragility stems from the lack of persistent memory to track evolving goals and task dependencies, undermining trust in autonomous agents. We introduce the Task Memory Engine (TME), a modular memory controller that transforms existing LLMs into robust, revision-aware agents without fine-tuning. TME implements a spatial memory framework that replaces flat context with graph-based structures to support consistent, multi-turn reasoning. Departing from linear concatenation and ReAct-style prompting, TME builds a dynamic task graph -- either a tree or directed acyclic graph (DAG) -- to map user inputs to subtasks, align them with prior context, and enable dependency-tracked revisions. Its Task Representation and Intent Management (TRIM) component models task semantics and user intent to ensure accurate interpretation. Across four multi-turn scenarios-trip planning, cooking, meeting scheduling, and shopping cart editing -- TME eliminates 100% of hallucinations and misinterpretations in three tasks, and reduces hallucinations by 66.7% and misinterpretations by 83.3% across 27 user turns, outperforming ReAct. TME's modular design supports plug-and-play deployment and domain-specific customization, adaptable to both personal assistants and enterprise automation. We release TME's codebase, benchmarks, and components as open-source resources, enabling researchers to develop reliable LLM agents. TME's scalable architecture addresses a critical gap in agent performance across complex, interactive settings.
