Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models

Md Messal Monem Miah; Ulie Schnaithmann; Arushi Raghuvanshi; Youngseo Son

Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models

Md Messal Monem Miah, Ulie Schnaithmann, Arushi Raghuvanshi, Youngseo Son

TL;DR

This work tackles real-time dialogue breakdown detection in healthcare-facing conversational AI by introducing MultConDB, a multimodal contextual framework that fuses audio signals with transcribed text. The authors compare four architectures—Text LSTM, End-to-End RoBERTa, MulT A+T, and MultConDB—finding that MultConDB’s combination of unimodal (text, audio) and multimodal encoders with a 5-turn context window yields the best performance, achieving a test F1 of $69.27$ and demonstrating strong generalization with $F1=71.22$, precision $65.77$, and recall $77.66$ on unseen data. Thorough qualitative analyses (e.g., t-SNE visualizations) reveal that contextual multimodal cues better separate breakdown from non-breakdown cases and can implicitly capture breakdown causes such as AI silence, interruptions, and skipped actions. The results underscore the practical impact of multimodal contextual reasoning for reliable, real-time intervention in healthcare dialogue systems, while acknowledging data privacy constraints and the need for local-hosted deployments with low latency ($0.06$s per utterance) for online use.

Abstract

Detecting dialogue breakdown in real time is critical for conversational AI systems, because it enables taking corrective action to successfully complete a task. In spoken dialog systems, this breakdown can be caused by a variety of unexpected situations including high levels of background noise, causing STT mistranscriptions, or unexpected user flows. In particular, industry settings like healthcare, require high precision and high flexibility to navigate differently based on the conversation history and dialogue states. This makes it both more challenging and more critical to accurately detect dialog breakdown. To accurately detect breakdown, we found it requires processing audio inputs along with downstream NLP model inferences on transcribed text in real time. In this paper, we introduce a Multimodal Contextual Dialogue Breakdown (MultConDB) model. This model significantly outperforms other known best models by achieving an F1 of 69.27.

Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models

TL;DR

and demonstrating strong generalization with

, precision

, and recall

on unseen data. Thorough qualitative analyses (e.g., t-SNE visualizations) reveal that contextual multimodal cues better separate breakdown from non-breakdown cases and can implicitly capture breakdown causes such as AI silence, interruptions, and skipped actions. The results underscore the practical impact of multimodal contextual reasoning for reliable, real-time intervention in healthcare dialogue systems, while acknowledging data privacy constraints and the need for local-hosted deployments with low latency (

s per utterance) for online use.

Abstract

Paper Structure (27 sections, 10 figures, 3 tables)

This paper contains 27 sections, 10 figures, 3 tables.

Introduction
Related Work
Method
Data
Models
Text LSTM.
End-to-End LLM Classifier.
Multimodal Transformer (MulT A+T).
MultConDB.
Results and Analysis
Task Evaluation
MultConDB Qualitative Analysis
Dialogue Breakdown Detection Model Generalizability Testing
Conclusion
Phone Call Dialogue Breakdown Examples
...and 12 more sections

Figures (10)

Figure 1: Example of dialogue breakdown in a phone call conversation caused by loud noise from user audio. See more examples in Section \ref{['sec:db_example']}
Figure 2: MultConDB model architecture.
Figure 3: Number of turns between dialogue breakdown ground truth and first model predictions.
Figure 4: Breakdown and non-breakdown turns of users and our conversational AI model captured by our model output layer. Before figure shows 2D t-SNE of our model input embedding (concatenation of speaker tag, utterance, AI agent intent and audio) and After figure shows the last output layer of our model right before prediction head softmax layer.
Figure 5: 2D t-SNE of MultConDB output layers colored by types of dialogue breakdown.
...and 5 more figures

Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models

TL;DR

Abstract

Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)