Table of Contents
Fetching ...

According to Me: Long-Term Personalized Referential Memory QA

Jingbiao Mei, Jinghong Chen, Guangyu Yang, Xinyu Hou, Margaret Li, Bill Byrne

TL;DR

This work introduces ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA, and proposes Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources.

Abstract

Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20\% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: https://github.com/JingbiaoMei/ATM-Bench

According to Me: Long-Term Personalized Referential Memory QA

TL;DR

This work introduces ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA, and proposes Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources.

Abstract

Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20\% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: https://github.com/JingbiaoMei/ATM-Bench
Paper Structure (92 sections, 12 equations, 7 figures, 14 tables)

This paper contains 92 sections, 12 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Core challenges in Long-term Multi-source Multimodal Personalized Referential Memory QA. (A) Resolving Personal References. Given a multimodal personal database, the agent needs to resolve the user's reference ("Grace") by linking to the user's past memory, and subsequently retrieve moments matching the query's condition (“being sneaky”). (B) Conflict-aware Multi-source Aggregation with Memory Updates Over Time. To compute total hotel expenses, the model aggregates heterogeneous evidence (emails and receipts), resolves conflicts between booking records and finalized invoices, and prioritizes later-updated memory entries that reflect the most recent and finalized information. This evaluates the agent’s ability to handle memory updates over time. (C) Temporal--Location--Visual Grounding. When images lack precise location cues, the model infers an event time window from an email memory item, retrieves images within that interval, and grounds them to the queried location to recover the ordered set of memory items.
  • Figure 2: Personal Memory Assistant Overview.
  • Figure 3: The UI of the annotation tool
  • Figure 4: (a) Evidence count. (b) Maximum time span between evidence items for a single QA instance. (c) Recall distance.
  • Figure 5: Geo coverage.
  • ...and 2 more figures