D-MEM: Dopamine-Gated Agentic Memory via Reward Prediction Error Routing

Yuru Song; Qi Xin

D-MEM: Dopamine-Gated Agentic Memory via Reward Prediction Error Routing

Yuru Song, Qi Xin

Abstract

Autonomous LLM agents require structured long-term memory, yet current "append-and-evolve" systems like A-MEM face O(N^2) write-latency and excessive token costs. We introduce D-MEM (Dopamine-Gated Agentic Memory), a biologically inspired architecture that decouples short-term interaction from cognitive restructuring via a Fast/Slow routing system based on Reward Prediction Error (RPE). A lightweight Critic Router evaluates stimuli for Surprise and Utility. Routine, low-RPE inputs are bypassed or cached in an O(1) fast-access buffer. Conversely, high-RPE inputs, such as factual contradictions or preference shifts, trigger a "dopamine" signal, activating the O(N) memory evolution pipeline to reshape the agent's knowledge graph. To evaluate performance under realistic conditions, we introduce the LoCoMo-Noise benchmark, which injects controlled conversational noise into long-term sessions. Evaluations demonstrate that D-MEM reduces token consumption by over 80%, eliminates O(N^2) bottlenecks, and outperforms baselines in multi-hop reasoning and adversarial resilience. By selectively gating cognitive restructuring, D-MEM provides a scalable, cost-efficient foundation for lifelong agentic memory.

D-MEM: Dopamine-Gated Agentic Memory via Reward Prediction Error Routing

Abstract

Paper Structure (23 sections, 3 equations, 5 figures, 3 tables)

This paper contains 23 sections, 3 equations, 5 figures, 3 tables.

Introduction
The LoCoMo-Noise Benchmark
Motivation
Noise Injection Methodology
Related Work
Static Retrieval and Working Memory Constraints
Dynamic and Evolving Agentic Memory Systems
Bio-Inspired Routing and Fast/Slow Cognitive Gating
Methodology: The D-MEM Architecture
Agentic Reward Prediction Error (RPE)
The Critic Router and Hierarchical Routing
Zero-Cost Retrieval Augmentation
Experiments & Results
Long-term Memory Accuracy
Efficiency: Token Savings via Intelligent Routing
...and 8 more sections

Figures (5)

Figure 1: LoCoMo-Noise benchmark construction pipeline. An LLM-based noise generator synthesizes three categories of noise---Filler (40%), Status (30%), and Tangent (30%)---and interleaves them with the original session at a target noise ratio of 75%. The core factual turns ("needles") are preserved at their original positions, while synthetic noise is injected at random intervals to simulate real-world conversational dynamics.
Figure 2: RPE component decomposition over 700 turns. The top panel shows the RPE signal (black) overlaid with its Surprise (orange) and Utility (purple) components, along with the routing thresholds $\theta_{low}=0.3$ and $\theta_{high}=0.7$. The bottom panel visualizes the resulting routing tier assignment per turn: SKIP (gray), CONSTRUCT_ONLY (blue), and FULL_EVOLUTION (red). The dominance of blue and the sparsity of red confirm that expensive memory evolution is reserved for genuinely paradigm-shifting inputs.
Figure 3: Routing analysis on the LoCoMo-Noise benchmark.(a) Scatter plot of all turns in the Surprise--Utility space, colored by the routing tier assigned by the Critic Router. All SKIP decisions are tightly concentrated in the "Early cutoff" region ($\text{Utility} < 0.3$, yellow dashed line), confirming that the routing is governed by Utility rather than Surprise. (b) Routing distribution stratified by input type (Noise vs. Real). Counter-intuitively, real turns are skipped at a higher rate (53.9%) than noise turns (43.2%), revealing a calibration asymmetry discussed in the text.
Figure 4: Attention heatmap: cosine similarity between query turns and memory slots. Each row corresponds to a dialogue turn (y-axis) and each column to a memory slot index (x-axis). The characteristic staircase-diagonal structure confirms that the active memory frontier advances monotonically as the session progresses. Vertical high-similarity bands (deep red) indicate memory slots that remain persistently salient across hundreds of subsequent turns---the physical substrate of Multi-hop reasoning. The left-side tier strip encodes the routing decision for each turn: gray (SKIP), blue (CONSTRUCT_ONLY), red (FULL_EVOLUTION).
Figure 5: D-MEM Memory Manifold (UMAP). Each point represents a memory entry, colored by Turn Number (purple $\to$ yellow) and shaped by routing tier: CONSTRUCT (star), EVOLVE (circle), SKIP (cross). Blue diamonds mark LTM (final) nodes retained in the persistent knowledge graph; gray dots mark STM (final) buffer entries. LTM nodes occupy dense, well-separated topical clusters across the manifold, while STM entries form a compact, isolated region---indicating that the hierarchical routing produces a structurally stable latent space rather than representation collapse.

D-MEM: Dopamine-Gated Agentic Memory via Reward Prediction Error Routing

Abstract

D-MEM: Dopamine-Gated Agentic Memory via Reward Prediction Error Routing

Authors

Abstract

Table of Contents

Figures (5)