Table of Contents
Fetching ...

M3TCM: Multi-modal Multi-task Context Model for Utterance Classification in Motivational Interviews

Sayed Muddashir Hossain, Jan Alexandersson, Philipp Müller

TL;DR

Motivational interviews require accurate utterance labeling that respects therapist-client roles, emotion, and conversational context. The authors introduce M3TCM, a multi-modal, multi-task framework that fuses text (RoBERTa) and audio (AST) with a shared self-attention context model and separate therapist and client classifiers. On the AnnoMI dataset, M3TCM achieves substantial gains over prior work, with client F1 of 0.66 and therapist F1 of 0.83, compared with 0.55 and 0.72, respectively, and multi-contrast ablations demonstrate the value of context, modality fusion, and shared context. They also show that a context window of about 10 utterances yields optimal performance and that online evaluation aligns with offline results. The work suggests future extensions to video modality and applications to other asymmetric conversational domains.

Abstract

Accurate utterance classification in motivational interviews is crucial to automatically understand the quality and dynamics of client-therapist interaction, and it can serve as a key input for systems mediating such interactions. Motivational interviews exhibit three important characteristics. First, there are two distinct roles, namely client and therapist. Second, they are often highly emotionally charged, which can be expressed both in text and in prosody. Finally, context is of central importance to classify any given utterance. Previous works did not adequately incorporate all of these characteristics into utterance classification approaches for mental health dialogues. In contrast, we present M3TCM, a Multi-modal, Multi-task Context Model for utterance classification. Our approach for the first time employs multi-task learning to effectively model both joint and individual components of therapist and client behaviour. Furthermore, M3TCM integrates information from the text and speech modality as well as the conversation context. With our novel approach, we outperform the state of the art for utterance classification on the recently introduced AnnoMI dataset with a relative improvement of 20% for the client- and by 15% for therapist utterance classification. In extensive ablation studies, we quantify the improvement resulting from each contribution.

M3TCM: Multi-modal Multi-task Context Model for Utterance Classification in Motivational Interviews

TL;DR

Motivational interviews require accurate utterance labeling that respects therapist-client roles, emotion, and conversational context. The authors introduce M3TCM, a multi-modal, multi-task framework that fuses text (RoBERTa) and audio (AST) with a shared self-attention context model and separate therapist and client classifiers. On the AnnoMI dataset, M3TCM achieves substantial gains over prior work, with client F1 of 0.66 and therapist F1 of 0.83, compared with 0.55 and 0.72, respectively, and multi-contrast ablations demonstrate the value of context, modality fusion, and shared context. They also show that a context window of about 10 utterances yields optimal performance and that online evaluation aligns with offline results. The work suggests future extensions to video modality and applications to other asymmetric conversational domains.

Abstract

Accurate utterance classification in motivational interviews is crucial to automatically understand the quality and dynamics of client-therapist interaction, and it can serve as a key input for systems mediating such interactions. Motivational interviews exhibit three important characteristics. First, there are two distinct roles, namely client and therapist. Second, they are often highly emotionally charged, which can be expressed both in text and in prosody. Finally, context is of central importance to classify any given utterance. Previous works did not adequately incorporate all of these characteristics into utterance classification approaches for mental health dialogues. In contrast, we present M3TCM, a Multi-modal, Multi-task Context Model for utterance classification. Our approach for the first time employs multi-task learning to effectively model both joint and individual components of therapist and client behaviour. Furthermore, M3TCM integrates information from the text and speech modality as well as the conversation context. With our novel approach, we outperform the state of the art for utterance classification on the recently introduced AnnoMI dataset with a relative improvement of 20% for the client- and by 15% for therapist utterance classification. In extensive ablation studies, we quantify the improvement resulting from each contribution.
Paper Structure (15 sections, 3 equations, 3 figures, 1 table)

This paper contains 15 sections, 3 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview over the M3TCM Model. Several consecutive therapist and client utterances ($u_{ti}$ and $u_{ci}$, respectively) are encoded using RoBERTa and AST models, producing text and audio embeddings. A shared self-attention layer models conversation context across utterances. Finally, separate classification networks produce predictions for therapist and client utterances.
  • Figure 2: Performance for therapist and client utterance classification for different context sizes.
  • Figure 3: Performance of therapist and client utterance classification for different context sizes in an online evaluation scenario.