PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering
Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, Kam-Fai Wong
TL;DR
PerLTQA tackles the challenge of incorporating personal long-term memory into QA by introducing a dataset that fuses semantic memories (profiles, social ties) with episodic memories (events, dialogues) and a three-stage memory integration framework (classification, retrieval, synthesis). The approach is evaluated across five LLMs and three retrievers, revealing that BERT-based memory classification outperforms several LLMs in memory type prediction and that effective memory retrieval is critical for accurate, memory-grounded responses. Empirical results show meaningful improvements in memory-informed synthesis (MAP up to 0.756, correctness up to 0.573) and demonstrate the practical viability of memory-augmented QA, even with smaller models. The dataset provides rich memory-anchored QA content (141 profiles, 1,339 relationships, 4,501 events, 3,409 dialogues, 8,593 QA pairs) and a rigorous evaluation protocol, offering a valuable benchmark for personalized, memory-aware NLP systems and future memory-integrated dialogue agents.
Abstract
Long-term memory plays a critical role in personal interaction, considering long-term memory can better leverage world knowledge, historical information, and preferences in dialogues. Our research introduces PerLTQA, an innovative QA dataset that combines semantic and episodic memories, including world knowledge, profiles, social relationships, events, and dialogues. This dataset is collected to investigate the use of personalized memories, focusing on social interactions and events in the QA task. PerLTQA features two types of memory and a comprehensive benchmark of 8,593 questions for 30 characters, facilitating the exploration and application of personalized memories in Large Language Models (LLMs). Based on PerLTQA, we propose a novel framework for memory integration and generation, consisting of three main components: Memory Classification, Memory Retrieval, and Memory Synthesis. We evaluate this framework using five LLMs and three retrievers. Experimental results demonstrate that BERT-based classification models significantly outperform LLMs such as ChatGLM3 and ChatGPT in the memory classification task. Furthermore, our study highlights the importance of effective memory integration in the QA task.
