Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models
Heeseung Kim, Che Hyun Lee, Sangkwon Park, Jiheum Yeom, Nohil Park, Sangwon Yu, Sungroh Yoon
TL;DR
The paper investigates whether open-source voice interaction models can maintain and utilize conversational history in multi-turn dialogs. It introduces ContextDialog, a speech-to-speech benchmark derived from MultiDialog to explicitly test recall of past utterances and the use of retrieved past information. Across experiments with several open-source models, the study finds a sizable gap compared with text-based systems, with recall particularly weak for user utterances and limited benefits from retrieval-augmented generation due to retrieval noise and generation biases. The results underscore the need for improved long-context modeling and robust memory or retrieval mechanisms to make open-source voice assistants viable in real-world, multi-turn scenarios. The work also provides a benchmark and analysis framework to guide future research into memory retention and utilization in spoken-dialog systems.
Abstract
Recent advancements in multi-turn voice interaction models have improved user-model communication. However, while closed-source models effectively retain and recall past utterances, whether open-source models share this ability remains unexplored. To fill this gap, we systematically evaluate how well open-source interaction models utilize past utterances using ContextDialog, a benchmark we proposed for this purpose. Our findings show that speech-based models have more difficulty than text-based ones, especially when recalling information conveyed in speech, and even with retrieval-augmented generation, models still struggle with questions about past utterances. These insights highlight key limitations in open-source models and suggest ways to improve memory retention and retrieval robustness.
