Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models
Md Messal Monem Miah, Ulie Schnaithmann, Arushi Raghuvanshi, Youngseo Son
TL;DR
This work tackles real-time dialogue breakdown detection in healthcare-facing conversational AI by introducing MultConDB, a multimodal contextual framework that fuses audio signals with transcribed text. The authors compare four architectures—Text LSTM, End-to-End RoBERTa, MulT A+T, and MultConDB—finding that MultConDB’s combination of unimodal (text, audio) and multimodal encoders with a 5-turn context window yields the best performance, achieving a test F1 of $69.27$ and demonstrating strong generalization with $F1=71.22$, precision $65.77$, and recall $77.66$ on unseen data. Thorough qualitative analyses (e.g., t-SNE visualizations) reveal that contextual multimodal cues better separate breakdown from non-breakdown cases and can implicitly capture breakdown causes such as AI silence, interruptions, and skipped actions. The results underscore the practical impact of multimodal contextual reasoning for reliable, real-time intervention in healthcare dialogue systems, while acknowledging data privacy constraints and the need for local-hosted deployments with low latency ($0.06$s per utterance) for online use.
Abstract
Detecting dialogue breakdown in real time is critical for conversational AI systems, because it enables taking corrective action to successfully complete a task. In spoken dialog systems, this breakdown can be caused by a variety of unexpected situations including high levels of background noise, causing STT mistranscriptions, or unexpected user flows. In particular, industry settings like healthcare, require high precision and high flexibility to navigate differently based on the conversation history and dialogue states. This makes it both more challenging and more critical to accurately detect dialog breakdown. To accurately detect breakdown, we found it requires processing audio inputs along with downstream NLP model inferences on transcribed text in real time. In this paper, we introduce a Multimodal Contextual Dialogue Breakdown (MultConDB) model. This model significantly outperforms other known best models by achieving an F1 of 69.27.
