CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding
Johannes Kirmayr, Lukas Stappen, Phillip Schneider, Florian Matthes, Elisabeth André
TL;DR
CarMem introduces a category-bound long-term memory architecture for LLM-enabled voice assistants to improve personalization while respecting privacy. It uses three modules—Extraction, Maintenance, Retrieval—implemented via LLM function calling to output structured memories and to retrieve relevant memories via embedding similarity. The authors validate on CarMem, a synthetic in-car dataset with $1{,}000$ Extraction Conversations, $1{,}000$ Retrieval Utterances, and $3{,}000$ Maintenance Utterances, achieving $F1$ scores in extraction from $0.78$ to $0.95$, memory redundancy reductions of up to $95$ percent and contradiction reductions of up to $93$ percent, and retrieval accuracy of $0.87$. The work demonstrates industrial relevance and transparency benefits from category-based storage, while noting limitations such as dataset scope and potential LLM biases, and outlining future extensions to other domains.
Abstract
In today's assistant landscape, personalisation enhances interactions, fosters long-term relationships, and deepens engagement. However, many systems struggle with retaining user preferences, leading to repetitive user requests and disengagement. Furthermore, the unregulated and opaque extraction of user preferences in industry applications raises significant concerns about privacy and trust, especially in regions with stringent regulations like Europe. In response to these challenges, we propose a long-term memory system for voice assistants, structured around predefined categories. This approach leverages Large Language Models to efficiently extract, store, and retrieve preferences within these categories, ensuring both personalisation and transparency. We also introduce a synthetic multi-turn, multi-session conversation dataset (CarMem), grounded in real industry data, tailored to an in-car voice assistant setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to .95 in preference extraction, depending on category granularity. Our maintenance strategy reduces redundant preferences by 95% and contradictory ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively, the results demonstrate the system's suitability for industrial applications.
