Table of Contents
Fetching ...

ENGRAM: Effective, Lightweight Memory Orchestration for Conversational Agents

Daivik Patel, Shrenik Patel

TL;DR

This work tackles the challenge of maintaining long-horizon consistency in conversational LLMs without resorting to large, complex memory architectures. It introduces ENGRAM, a compact memory system that partitions memories into episodic, semantic, and procedural types, connected by a single router and dense retrieval, storing records in a local SQLite store. ENGRAM demonstrates state-of-the-art semantic correctness on LoCoMo and surpasses a full-context baseline on LongMemEval while using roughly 1% of the tokens, highlighting substantial gains in efficiency without sacrificing accuracy. The findings suggest that careful memory typing coupled with straightforward retrieval can enable scalable, reproducible long-term memory for chat agents, and the authors provide a reproducible implementation and evaluation harness to encourage adoption and further research.

Abstract

Large language models (LLMs) deployed in user-facing applications require long-horizon consistency: the ability to remember prior interactions, respect user preferences, and ground reasoning in past events. However, contemporary memory systems often adopt complex architectures such as knowledge graphs, multi-stage retrieval pipelines, and OS-style schedulers, which introduce engineering complexity and reproducibility challenges. We present ENGRAM, a lightweight memory system that organizes conversation into three canonical memory types (episodic, semantic, and procedural) through a single router and retriever. Each user turn is converted into typed memory records with normalized schemas and embeddings and stored in a database. At query time, the system retrieves top-k dense neighbors for each type, merges results with simple set operations, and provides the most relevant evidence as context to the model. ENGRAM attains state-of-the-art results on LoCoMo, a multi-session conversational QA benchmark for long-horizon memory, and exceeds the full-context baseline by 15 points on LongMemEval while using only about 1% of the tokens. These results show that careful memory typing and straightforward dense retrieval can enable effective long-term memory management in language models without requiring complex architectures.

ENGRAM: Effective, Lightweight Memory Orchestration for Conversational Agents

TL;DR

This work tackles the challenge of maintaining long-horizon consistency in conversational LLMs without resorting to large, complex memory architectures. It introduces ENGRAM, a compact memory system that partitions memories into episodic, semantic, and procedural types, connected by a single router and dense retrieval, storing records in a local SQLite store. ENGRAM demonstrates state-of-the-art semantic correctness on LoCoMo and surpasses a full-context baseline on LongMemEval while using roughly 1% of the tokens, highlighting substantial gains in efficiency without sacrificing accuracy. The findings suggest that careful memory typing coupled with straightforward retrieval can enable scalable, reproducible long-term memory for chat agents, and the authors provide a reproducible implementation and evaluation harness to encourage adoption and further research.

Abstract

Large language models (LLMs) deployed in user-facing applications require long-horizon consistency: the ability to remember prior interactions, respect user preferences, and ground reasoning in past events. However, contemporary memory systems often adopt complex architectures such as knowledge graphs, multi-stage retrieval pipelines, and OS-style schedulers, which introduce engineering complexity and reproducibility challenges. We present ENGRAM, a lightweight memory system that organizes conversation into three canonical memory types (episodic, semantic, and procedural) through a single router and retriever. Each user turn is converted into typed memory records with normalized schemas and embeddings and stored in a database. At query time, the system retrieves top-k dense neighbors for each type, merges results with simple set operations, and provides the most relevant evidence as context to the model. ENGRAM attains state-of-the-art results on LoCoMo, a multi-session conversational QA benchmark for long-horizon memory, and exceeds the full-context baseline by 15 points on LongMemEval while using only about 1% of the tokens. These results show that careful memory typing and straightforward dense retrieval can enable effective long-term memory management in language models without requiring complex architectures.

Paper Structure

This paper contains 32 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: System overview of ENGRAM. Turns are routed into typed stores, embedded, persisted, and later retrieved with semantic search before being passed as context to an answering model. The diagram highlights both the memory creation stage (routing and extraction) and the retrieval stage (top-$k$ selection, aggregation, and prompt injection). Numbers (1)--(5) mark the main components and are referenced below.
  • Figure 2: End-to-end ENGRAM QA walkthrough. The diagram illustrates how turns are first routed into typed stores (episodic/semantic/procedural), normalized, and embedded then—at query time—how the query is embedded and used to retrieve per-type top-$k$ neighbors by cosine similarity (default $K{=}25$), followed by aggregation and deduplication. Finally, retrieved, timestamped records are serialized into a fixed prompt template and passed to the answering model (gpt-4o-mini). The figure also shows the concrete example (“What are the names of Audrey’s dogs?”), the evidence snippets with similarity scores, and ENGRAM 's answer vs. gold answer. Embedding and model components (text-embedding-3-small, gpt-4o-mini) are annotated to emphasize the separation between memory construction, retrieval, and answer generation.
  • Figure :