CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding

Johannes Kirmayr; Lukas Stappen; Phillip Schneider; Florian Matthes; Elisabeth André

CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding

Johannes Kirmayr, Lukas Stappen, Phillip Schneider, Florian Matthes, Elisabeth André

TL;DR

CarMem introduces a category-bound long-term memory architecture for LLM-enabled voice assistants to improve personalization while respecting privacy. It uses three modules—Extraction, Maintenance, Retrieval—implemented via LLM function calling to output structured memories and to retrieve relevant memories via embedding similarity. The authors validate on CarMem, a synthetic in-car dataset with $1{,}000$ Extraction Conversations, $1{,}000$ Retrieval Utterances, and $3{,}000$ Maintenance Utterances, achieving $F1$ scores in extraction from $0.78$ to $0.95$, memory redundancy reductions of up to $95$ percent and contradiction reductions of up to $93$ percent, and retrieval accuracy of $0.87$. The work demonstrates industrial relevance and transparency benefits from category-based storage, while noting limitations such as dataset scope and potential LLM biases, and outlining future extensions to other domains.

Abstract

In today's assistant landscape, personalisation enhances interactions, fosters long-term relationships, and deepens engagement. However, many systems struggle with retaining user preferences, leading to repetitive user requests and disengagement. Furthermore, the unregulated and opaque extraction of user preferences in industry applications raises significant concerns about privacy and trust, especially in regions with stringent regulations like Europe. In response to these challenges, we propose a long-term memory system for voice assistants, structured around predefined categories. This approach leverages Large Language Models to efficiently extract, store, and retrieve preferences within these categories, ensuring both personalisation and transparency. We also introduce a synthetic multi-turn, multi-session conversation dataset (CarMem), grounded in real industry data, tailored to an in-car voice assistant setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to .95 in preference extraction, depending on category granularity. Our maintenance strategy reduces redundant preferences by 95% and contradictory ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively, the results demonstrate the system's suitability for industrial applications.

CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding

TL;DR

Extraction Conversations,

Retrieval Utterances, and

Maintenance Utterances, achieving

scores in extraction from

, memory redundancy reductions of up to

percent and contradiction reductions of up to

percent, and retrieval accuracy of

. The work demonstrates industrial relevance and transparency benefits from category-based storage, while noting limitations such as dataset scope and potential LLM biases, and outlining future extensions to other domains.

Abstract

Paper Structure (29 sections, 6 figures, 7 tables)

This paper contains 29 sections, 6 figures, 7 tables.

Introduction
Related Work
Structured and Category-Bound User-Preference-Memory
Preference Extraction
Preference Maintenance
Preference Retrieval
Data
Experiments
Preference Extraction
Experiment Setting
Preference Maintenance
Experiment Setting
Preference Retrieval
Experiment Setting
Conclusion
...and 14 more sections

Figures (6)

Figure 1: High-level memory flow: After a conversation, preferences are extracted (1) based on the predefined category schema (e.g. preferred radio station). Topics outside the category schema, such as favourite movies, are not extracted. (2) Before inserting a new preference, it is compared to existing preferences for consistency, applying the most suitable maintenance operation: append, pass, or update. Within the next conversation (3), the voice assistant retrieves semantically relevant preferences (3a) from the user storage (3b) to provide a personalized response.
Figure 2: Representative subset of the hierarchically predefined preference categories. There are two types of detail categories: MP (yellow): Multiple preferences within the category are possible, and SP (orange): Single preference within the category is allowed. A full list of categories with attributes is provided in Appendix \ref{['subsec:full_list_categories']}.
Figure 3: Example data point of the synthetically generated CarMem dataset showing the three different parts.
Figure 4: The figure shows the diversity evaluation (Distinct-1, Distinct-2, Distinct-3) (y-axis) with dynamic and fixed inputs. The scores were calculated and then averaged for four different user preferences, with each preference's conversations being regenerated 1 to 10 times (x-axis).
Figure 5: Multi-Label confusion matrix MultiLabelConfusionMatrix, normalized across the rows, on the detail category level for the In-Schema experiments (refer to Section \ref{['subsec:exp_pref_extraction']}). The last row represents data points with no true label (NTL), while the last column represents data points with no predicted label (NPL).
...and 1 more figures

CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding

TL;DR

Abstract

CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding

Authors

TL;DR

Abstract

Table of Contents

Figures (6)