Table of Contents
Fetching ...

CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding

Johannes Kirmayr, Lukas Stappen, Phillip Schneider, Florian Matthes, Elisabeth André

TL;DR

CarMem introduces a category-bound long-term memory architecture for LLM-enabled voice assistants to improve personalization while respecting privacy. It uses three modules—Extraction, Maintenance, Retrieval—implemented via LLM function calling to output structured memories and to retrieve relevant memories via embedding similarity. The authors validate on CarMem, a synthetic in-car dataset with $1{,}000$ Extraction Conversations, $1{,}000$ Retrieval Utterances, and $3{,}000$ Maintenance Utterances, achieving $F1$ scores in extraction from $0.78$ to $0.95$, memory redundancy reductions of up to $95$ percent and contradiction reductions of up to $93$ percent, and retrieval accuracy of $0.87$. The work demonstrates industrial relevance and transparency benefits from category-based storage, while noting limitations such as dataset scope and potential LLM biases, and outlining future extensions to other domains.

Abstract

In today's assistant landscape, personalisation enhances interactions, fosters long-term relationships, and deepens engagement. However, many systems struggle with retaining user preferences, leading to repetitive user requests and disengagement. Furthermore, the unregulated and opaque extraction of user preferences in industry applications raises significant concerns about privacy and trust, especially in regions with stringent regulations like Europe. In response to these challenges, we propose a long-term memory system for voice assistants, structured around predefined categories. This approach leverages Large Language Models to efficiently extract, store, and retrieve preferences within these categories, ensuring both personalisation and transparency. We also introduce a synthetic multi-turn, multi-session conversation dataset (CarMem), grounded in real industry data, tailored to an in-car voice assistant setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to .95 in preference extraction, depending on category granularity. Our maintenance strategy reduces redundant preferences by 95% and contradictory ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively, the results demonstrate the system's suitability for industrial applications.

CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding

TL;DR

CarMem introduces a category-bound long-term memory architecture for LLM-enabled voice assistants to improve personalization while respecting privacy. It uses three modules—Extraction, Maintenance, Retrieval—implemented via LLM function calling to output structured memories and to retrieve relevant memories via embedding similarity. The authors validate on CarMem, a synthetic in-car dataset with Extraction Conversations, Retrieval Utterances, and Maintenance Utterances, achieving scores in extraction from to , memory redundancy reductions of up to percent and contradiction reductions of up to percent, and retrieval accuracy of . The work demonstrates industrial relevance and transparency benefits from category-based storage, while noting limitations such as dataset scope and potential LLM biases, and outlining future extensions to other domains.

Abstract

In today's assistant landscape, personalisation enhances interactions, fosters long-term relationships, and deepens engagement. However, many systems struggle with retaining user preferences, leading to repetitive user requests and disengagement. Furthermore, the unregulated and opaque extraction of user preferences in industry applications raises significant concerns about privacy and trust, especially in regions with stringent regulations like Europe. In response to these challenges, we propose a long-term memory system for voice assistants, structured around predefined categories. This approach leverages Large Language Models to efficiently extract, store, and retrieve preferences within these categories, ensuring both personalisation and transparency. We also introduce a synthetic multi-turn, multi-session conversation dataset (CarMem), grounded in real industry data, tailored to an in-car voice assistant setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to .95 in preference extraction, depending on category granularity. Our maintenance strategy reduces redundant preferences by 95% and contradictory ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively, the results demonstrate the system's suitability for industrial applications.
Paper Structure (29 sections, 6 figures, 7 tables)

This paper contains 29 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: High-level memory flow: After a conversation, preferences are extracted (1) based on the predefined category schema (e.g. preferred radio station). Topics outside the category schema, such as favourite movies, are not extracted. (2) Before inserting a new preference, it is compared to existing preferences for consistency, applying the most suitable maintenance operation: append, pass, or update. Within the next conversation (3), the voice assistant retrieves semantically relevant preferences (3a) from the user storage (3b) to provide a personalized response.
  • Figure 2: Representative subset of the hierarchically predefined preference categories. There are two types of detail categories: MP (yellow): Multiple preferences within the category are possible, and SP (orange): Single preference within the category is allowed. A full list of categories with attributes is provided in Appendix \ref{['subsec:full_list_categories']}.
  • Figure 3: Example data point of the synthetically generated CarMem dataset showing the three different parts.
  • Figure 4: The figure shows the diversity evaluation (Distinct-1, Distinct-2, Distinct-3) (y-axis) with dynamic and fixed inputs. The scores were calculated and then averaged for four different user preferences, with each preference's conversations being regenerated 1 to 10 times (x-axis).
  • Figure 5: Multi-Label confusion matrix MultiLabelConfusionMatrix, normalized across the rows, on the detail category level for the In-Schema experiments (refer to Section \ref{['subsec:exp_pref_extraction']}). The last row represents data points with no true label (NTL), while the last column represents data points with no predicted label (NPL).
  • ...and 1 more figures