Table of Contents
Fetching ...

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Natchanon Pollertlam, Witchayut Kornsuwannawit

TL;DR

This work compares a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks and constructs a cost model that incorporates prompt caching and shows that the two architectures have structurally different cost profiles.

Abstract

Persistent conversational AI systems face a choice between passing full conversation histories to a long-context large language model (LLM) and maintaining a dedicated memory system that extracts and retrieves structured facts. We compare a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks - LongMemEval, LoCoMo, and PersonaMemv2 - and evaluate both architectures on accuracy and cumulative API cost. Long-context GPT-5-mini achieves higher factual recall on LongMemEval and LoCoMo, while the memory system is competitive on PersonaMemv2, where persona consistency depends on stable, factual attributes suited to flat-typed extraction. We construct a cost model that incorporates prompt caching and show that the two architectures have structurally different cost profiles: long-context inference incurs a per-turn charge that grows with context length even under caching, while the memory system's per-turn read cost remains roughly fixed after a one-time write phase. At a context length of 100k tokens, the memory system becomes cheaper after approximately ten interaction turns, with the break-even point decreasing as context length grows. These results characterize the accuracy-cost trade-off between the two approaches and provide a concrete criterion for selecting between them in production deployments.

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

TL;DR

This work compares a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks and constructs a cost model that incorporates prompt caching and shows that the two architectures have structurally different cost profiles.

Abstract

Persistent conversational AI systems face a choice between passing full conversation histories to a long-context large language model (LLM) and maintaining a dedicated memory system that extracts and retrieves structured facts. We compare a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks - LongMemEval, LoCoMo, and PersonaMemv2 - and evaluate both architectures on accuracy and cumulative API cost. Long-context GPT-5-mini achieves higher factual recall on LongMemEval and LoCoMo, while the memory system is competitive on PersonaMemv2, where persona consistency depends on stable, factual attributes suited to flat-typed extraction. We construct a cost model that incorporates prompt caching and show that the two architectures have structurally different cost profiles: long-context inference incurs a per-turn charge that grows with context length even under caching, while the memory system's per-turn read cost remains roughly fixed after a one-time write phase. At a context length of 100k tokens, the memory system becomes cheaper after approximately ten interaction turns, with the break-even point decreasing as context length grows. These results characterize the accuracy-cost trade-off between the two approaches and provide a concrete criterion for selecting between them in production deployments.
Paper Structure (35 sections, 3 figures, 7 tables)

This paper contains 35 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Break-even heatmap: cumulative cost difference (LC minus Memory) as a function of context length $L$ and number of turns $N$. Red regions indicate where the long-context approach is cheaper; blue regions indicate where the memory system is cheaper. The black curve marks the break-even boundary.
  • Figure 2: Custom instructions passed to the Mem0 add stage for memory extraction.
  • Figure 3: System and user prompt templates used for LLM-as-a-judge evaluation.