Table of Contents
Fetching ...

Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models

Heeseung Kim, Che Hyun Lee, Sangkwon Park, Jiheum Yeom, Nohil Park, Sangwon Yu, Sungroh Yoon

TL;DR

The paper investigates whether open-source voice interaction models can maintain and utilize conversational history in multi-turn dialogs. It introduces ContextDialog, a speech-to-speech benchmark derived from MultiDialog to explicitly test recall of past utterances and the use of retrieved past information. Across experiments with several open-source models, the study finds a sizable gap compared with text-based systems, with recall particularly weak for user utterances and limited benefits from retrieval-augmented generation due to retrieval noise and generation biases. The results underscore the need for improved long-context modeling and robust memory or retrieval mechanisms to make open-source voice assistants viable in real-world, multi-turn scenarios. The work also provides a benchmark and analysis framework to guide future research into memory retention and utilization in spoken-dialog systems.

Abstract

Recent advancements in multi-turn voice interaction models have improved user-model communication. However, while closed-source models effectively retain and recall past utterances, whether open-source models share this ability remains unexplored. To fill this gap, we systematically evaluate how well open-source interaction models utilize past utterances using ContextDialog, a benchmark we proposed for this purpose. Our findings show that speech-based models have more difficulty than text-based ones, especially when recalling information conveyed in speech, and even with retrieval-augmented generation, models still struggle with questions about past utterances. These insights highlight key limitations in open-source models and suggest ways to improve memory retention and retrieval robustness.

Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models

TL;DR

The paper investigates whether open-source voice interaction models can maintain and utilize conversational history in multi-turn dialogs. It introduces ContextDialog, a speech-to-speech benchmark derived from MultiDialog to explicitly test recall of past utterances and the use of retrieved past information. Across experiments with several open-source models, the study finds a sizable gap compared with text-based systems, with recall particularly weak for user utterances and limited benefits from retrieval-augmented generation due to retrieval noise and generation biases. The results underscore the need for improved long-context modeling and robust memory or retrieval mechanisms to make open-source voice assistants viable in real-world, multi-turn scenarios. The work also provides a benchmark and analysis framework to guide future research into memory retention and utilization in spoken-dialog systems.

Abstract

Recent advancements in multi-turn voice interaction models have improved user-model communication. However, while closed-source models effectively retain and recall past utterances, whether open-source models share this ability remains unexplored. To fill this gap, we systematically evaluate how well open-source interaction models utilize past utterances using ContextDialog, a benchmark we proposed for this purpose. Our findings show that speech-based models have more difficulty than text-based ones, especially when recalling information conveyed in speech, and even with retrieval-augmented generation, models still struggle with questions about past utterances. These insights highlight key limitations in open-source models and suggest ways to improve memory retention and retrieval robustness.

Paper Structure

This paper contains 26 sections, 15 figures, 13 tables.

Figures (15)

  • Figure 1: Overview of the ContextDialog generation process. Past-recall QA pairs are first generated and validated (Section \ref{['subsec:generation']}), then converted to speech via adaptive TTS and verified both automatically and manually (Section \ref{['subsec:tts']}).
  • Figure 2: Overview of our analyses. In Section \ref{['subsec:recall']}, we evaluate model recall by analyzing responses to questions about (a) past user and (b) past model utterances. In Section \ref{['subsec:retrieval']}, we examine whether (c) augmenting spoken response generation with separately retrieved utterances improves responses to questions about past utterances.
  • Figure 3: Attention maps for ground truth answers given each model's past dialog and question. The $x$-axis of the figure indicates the order of utterances of each speaker ("U" for user, "M" for model), while the $y$-axis shows the index of attention layer. In each subfigure, the left side represents questions about past user utterances, and the right side represents questions about past model utterances. Red boxes indicate the positions of supporting utterances.
  • Figure 4: The results of applying the RAG method to each model are shown. The $y$-axis values indicate GPT Scores on a 5-point scale, with higher scores representing better performance. The red dashed line indicates the results generated without RAG (Section \ref{['subsec:recall']}). The evaluation is based on the transcribed spoken response $\mathcal{S\rightarrow T,\colorbox{yellow!40}{$\underline{\bm{\mathcal{S}}}$}}$.
  • Figure 5: Two representative approaches for generating text alongside spoken responses to enhance semantic coherence in voice interaction models.
  • ...and 10 more figures