Table of Contents
Fetching ...

Convomem Benchmark: Why Your First 150 Conversations Don't Need RAG

Egor Pakhomov, Erik Nijkamp, Caiming Xiong

TL;DR

This paper introduces ConvoMem, a large-scale conversational memory benchmark of 75,336 QA pairs designed to evaluate memory capabilities across six categories and multi-message evidence distributions. It analyzes the relationship between memory and RAG, showing that memory systems can leverage small, growing corpora to achieve high accuracy with naive long-context policies for the first ~150 conversations, while sophisticated RAG approaches lag behind in early stages but offer cost advantages at scale. The authors propose a progressive memory design philosophy, including block-based and single-pass hybrid extraction architectures, and demonstrate how mid-tier models can match or approach premium-model performance at a fraction of the cost. The work provides practical deployment guidance for enterprise dialogue systems, arguing for a dual-track strategy that uses long-context memory for recent conversations and cost-effective RAG for longer histories, supported by a fully reproducible data-generation and evaluation framework.

Abstract

We introduce a comprehensive benchmark for conversational memory evaluation containing 75,336 question-answer pairs across diverse categories including user facts, assistant recall, abstention, preferences, temporal changes, and implicit connections. While existing benchmarks have advanced the field, our work addresses fundamental challenges in statistical power, data generation consistency, and evaluation flexibility that limit current memory evaluation frameworks. We examine the relationship between conversational memory and retrieval-augmented generation (RAG). While these systems share fundamental architectural patterns--temporal reasoning, implicit extraction, knowledge updates, and graph representations--memory systems have a unique characteristic: they start from zero and grow progressively with each conversation. This characteristic enables naive approaches that would be impractical for traditional RAG. Consistent with recent findings on long context effectiveness, we observe that simple full-context approaches achieve 70-82% accuracy even on our most challenging multi-message evidence cases, while sophisticated RAG-based memory systems like Mem0 achieve only 30-45% when operating on conversation histories under 150 interactions. Our analysis reveals practical transition points: long context excels for the first 30 conversations, remains viable with manageable trade-offs up to 150 conversations, and typically requires hybrid or RAG approaches beyond that point as costs and latencies become prohibitive. These patterns indicate that the small-corpus advantage of conversational memory--where exhaustive search and complete reranking are feasible--deserves dedicated research attention rather than simply applying general RAG solutions to conversation histories.

Convomem Benchmark: Why Your First 150 Conversations Don't Need RAG

TL;DR

This paper introduces ConvoMem, a large-scale conversational memory benchmark of 75,336 QA pairs designed to evaluate memory capabilities across six categories and multi-message evidence distributions. It analyzes the relationship between memory and RAG, showing that memory systems can leverage small, growing corpora to achieve high accuracy with naive long-context policies for the first ~150 conversations, while sophisticated RAG approaches lag behind in early stages but offer cost advantages at scale. The authors propose a progressive memory design philosophy, including block-based and single-pass hybrid extraction architectures, and demonstrate how mid-tier models can match or approach premium-model performance at a fraction of the cost. The work provides practical deployment guidance for enterprise dialogue systems, arguing for a dual-track strategy that uses long-context memory for recent conversations and cost-effective RAG for longer histories, supported by a fully reproducible data-generation and evaluation framework.

Abstract

We introduce a comprehensive benchmark for conversational memory evaluation containing 75,336 question-answer pairs across diverse categories including user facts, assistant recall, abstention, preferences, temporal changes, and implicit connections. While existing benchmarks have advanced the field, our work addresses fundamental challenges in statistical power, data generation consistency, and evaluation flexibility that limit current memory evaluation frameworks. We examine the relationship between conversational memory and retrieval-augmented generation (RAG). While these systems share fundamental architectural patterns--temporal reasoning, implicit extraction, knowledge updates, and graph representations--memory systems have a unique characteristic: they start from zero and grow progressively with each conversation. This characteristic enables naive approaches that would be impractical for traditional RAG. Consistent with recent findings on long context effectiveness, we observe that simple full-context approaches achieve 70-82% accuracy even on our most challenging multi-message evidence cases, while sophisticated RAG-based memory systems like Mem0 achieve only 30-45% when operating on conversation histories under 150 interactions. Our analysis reveals practical transition points: long context excels for the first 30 conversations, remains viable with manageable trade-offs up to 150 conversations, and typically requires hybrid or RAG approaches beyond that point as costs and latencies become prohibitive. These patterns indicate that the small-corpus advantage of conversational memory--where exhaustive search and complete reranking are feasible--deserves dedicated research attention rather than simply applying general RAG solutions to conversation histories.

Paper Structure

This paper contains 76 sections, 20 figures, 2 tables.

Figures (20)

  • Figure 1: Long Context Memory Accuracy
  • Figure 2: Long Context Cost Scaling
  • Figure 3: Long Context Latency
  • Figure 4: User Facts Evidence Scaling
  • Figure 5: Changing Facts Evidence Scaling
  • ...and 15 more figures