Table of Contents
Fetching ...

Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents

Ye Ye

TL;DR

The paper tackles the fragility of LLM-based agents in multi-step tasks by replacing linear context with a graph-based Task Memory Engine (TME) that supports revision-aware reasoning without fine-tuning. A central component, the Task Memory Structure (TMS) implemented as a DAG, tracks subtasks, dependencies, and revisions, while TRIM models user intent to generate compact, consistent subgraphs for LLM prompts. Across four case studies (trip planning, cooking, meeting scheduling, cart editing), TME-DAG substantially reduces hallucinations and misinterpretations, achieving up to 100% reductions and full task consistency, outperforming ReAct in complex, revision-heavy scenarios. The approach also demonstrates token efficiency through selective memory retrieval and promises plug-and-play deployment with open-source code and benchmarks, enabling broader adoption for reliable LLM agents. The work lays groundwork for future enhancements with Graph Neural Networks and loop-aware reasoning to handle more intricate, enterprise-scale workflows.

Abstract

Large Language Models (LLMs) falter in multi-step interactions -- often hallucinating, repeating actions, or misinterpreting user corrections -- due to reliance on linear, unstructured context. This fragility stems from the lack of persistent memory to track evolving goals and task dependencies, undermining trust in autonomous agents. We introduce the Task Memory Engine (TME), a modular memory controller that transforms existing LLMs into robust, revision-aware agents without fine-tuning. TME implements a spatial memory framework that replaces flat context with graph-based structures to support consistent, multi-turn reasoning. Departing from linear concatenation and ReAct-style prompting, TME builds a dynamic task graph -- either a tree or directed acyclic graph (DAG) -- to map user inputs to subtasks, align them with prior context, and enable dependency-tracked revisions. Its Task Representation and Intent Management (TRIM) component models task semantics and user intent to ensure accurate interpretation. Across four multi-turn scenarios-trip planning, cooking, meeting scheduling, and shopping cart editing -- TME eliminates 100% of hallucinations and misinterpretations in three tasks, and reduces hallucinations by 66.7% and misinterpretations by 83.3% across 27 user turns, outperforming ReAct. TME's modular design supports plug-and-play deployment and domain-specific customization, adaptable to both personal assistants and enterprise automation. We release TME's codebase, benchmarks, and components as open-source resources, enabling researchers to develop reliable LLM agents. TME's scalable architecture addresses a critical gap in agent performance across complex, interactive settings.

Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents

TL;DR

The paper tackles the fragility of LLM-based agents in multi-step tasks by replacing linear context with a graph-based Task Memory Engine (TME) that supports revision-aware reasoning without fine-tuning. A central component, the Task Memory Structure (TMS) implemented as a DAG, tracks subtasks, dependencies, and revisions, while TRIM models user intent to generate compact, consistent subgraphs for LLM prompts. Across four case studies (trip planning, cooking, meeting scheduling, cart editing), TME-DAG substantially reduces hallucinations and misinterpretations, achieving up to 100% reductions and full task consistency, outperforming ReAct in complex, revision-heavy scenarios. The approach also demonstrates token efficiency through selective memory retrieval and promises plug-and-play deployment with open-source code and benchmarks, enabling broader adoption for reliable LLM agents. The work lays groundwork for future enhancements with Graph Neural Networks and loop-aware reasoning to handle more intricate, enterprise-scale workflows.

Abstract

Large Language Models (LLMs) falter in multi-step interactions -- often hallucinating, repeating actions, or misinterpreting user corrections -- due to reliance on linear, unstructured context. This fragility stems from the lack of persistent memory to track evolving goals and task dependencies, undermining trust in autonomous agents. We introduce the Task Memory Engine (TME), a modular memory controller that transforms existing LLMs into robust, revision-aware agents without fine-tuning. TME implements a spatial memory framework that replaces flat context with graph-based structures to support consistent, multi-turn reasoning. Departing from linear concatenation and ReAct-style prompting, TME builds a dynamic task graph -- either a tree or directed acyclic graph (DAG) -- to map user inputs to subtasks, align them with prior context, and enable dependency-tracked revisions. Its Task Representation and Intent Management (TRIM) component models task semantics and user intent to ensure accurate interpretation. Across four multi-turn scenarios-trip planning, cooking, meeting scheduling, and shopping cart editing -- TME eliminates 100% of hallucinations and misinterpretations in three tasks, and reduces hallucinations by 66.7% and misinterpretations by 83.3% across 27 user turns, outperforming ReAct. TME's modular design supports plug-and-play deployment and domain-specific customization, adaptable to both personal assistants and enterprise automation. We release TME's codebase, benchmarks, and components as open-source resources, enabling researchers to develop reliable LLM agents. TME's scalable architecture addresses a critical gap in agent performance across complex, interactive settings.

Paper Structure

This paper contains 34 sections, 2 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: TME Architecture: TRIM orchestrates the flow through five steps: (1) decomposing inputs, (2) updating the TMS-DAG Forest, (3) retrieving subgraphs, (4) passing context to the LLM, and (5) generating responses.
  • Figure 2: TME execution pipeline (left) with example trace from the cooking scenario (right). TRIM handles input decomposition and intent classification (Steps 1–2); TMS-DAG Update occurs in Step 3; Retrieval + Response are Steps 4–5.
  • Figure 3: ReAct vs. TME-DAG: Hallucination and Confusion in Trip Planning Scenario (Rounds 8–10). This diagram illustrates how ReAct misinterprets a user query in Round 8 and a flight search in Round 9 as updates, resulting in hallucination in Round 10. In contrast, TME-DAG correctly treats the query as a check, logs the new flight as an independent node, and produces a consistent summary via memory tracking and slot-based dependency reasoning.
  • Figure 4: Comparison of ReAct and TME-DAG in a cooking scenario with ingredient substitution. Left: ReAct exhibits memory inconsistencies. Right: TME-DAG ensures cross-task consistency via graph-based updates.
  • Figure 5: Token usage trend for Baseline-flat and TME across the six rounds of the form-filling task.
  • ...and 1 more figures