Table of Contents
Fetching ...

Empathy Level Prediction in Multi-Modal Scenario with Supervisory Documentation Assistance

Yufei Xiao, Shangfei Wang

TL;DR

This paper tackles empathy level prediction by extending beyond text to a fully multi-modal framework that also leverages supervisory documents as privileged information during training. It introduces a two-part architecture: a Multi-Modal Empathy Prediction Network for fused text, audio, and video features, and a Supervisory Documentation Assisted Training module that uses LDA-derived topic distributions to supervise text representations. Experiments on the MEDIC dataset show superior accuracy and F1 scores, with cross-dataset validation on the Mental Health Subreddits dataset demonstrating generalization. The approach offers a practical pathway to richer empathy understanding in counseling by integrating privileged supervisory insights without needing them at inference time.

Abstract

Prevalent empathy prediction techniques primarily concentrate on a singular modality, typically textual, thus neglecting multi-modal processing capabilities. They also overlook the utilization of certain privileged information, which may encompass additional empathetic content. In response, we introduce an advanced multi-modal empathy prediction method integrating video, audio, and text information. The method comprises the Multi-Modal Empathy Prediction and Supervisory Documentation Assisted Training. We use pre-trained networks in the empathy prediction network to extract features from various modalities, followed by a cross-modal fusion. This process yields a multi-modal feature representation, which is employed to predict empathy labels. To enhance the extraction of text features, we incorporate supervisory documents as privileged information during the assisted training phase. Specifically, we apply the Latent Dirichlet Allocation model to identify potential topic distributions to constrain text features. These supervisory documents, created by supervisors, focus on the counseling topics and the counselor's display of empathy. Notably, this privileged information is only available during training and is not accessible during the prediction phase. Experimental results on the multi-modal and dialogue empathy datasets demonstrate that our approach is superior to the existing methods.

Empathy Level Prediction in Multi-Modal Scenario with Supervisory Documentation Assistance

TL;DR

This paper tackles empathy level prediction by extending beyond text to a fully multi-modal framework that also leverages supervisory documents as privileged information during training. It introduces a two-part architecture: a Multi-Modal Empathy Prediction Network for fused text, audio, and video features, and a Supervisory Documentation Assisted Training module that uses LDA-derived topic distributions to supervise text representations. Experiments on the MEDIC dataset show superior accuracy and F1 scores, with cross-dataset validation on the Mental Health Subreddits dataset demonstrating generalization. The approach offers a practical pathway to richer empathy understanding in counseling by integrating privileged supervisory insights without needing them at inference time.

Abstract

Prevalent empathy prediction techniques primarily concentrate on a singular modality, typically textual, thus neglecting multi-modal processing capabilities. They also overlook the utilization of certain privileged information, which may encompass additional empathetic content. In response, we introduce an advanced multi-modal empathy prediction method integrating video, audio, and text information. The method comprises the Multi-Modal Empathy Prediction and Supervisory Documentation Assisted Training. We use pre-trained networks in the empathy prediction network to extract features from various modalities, followed by a cross-modal fusion. This process yields a multi-modal feature representation, which is employed to predict empathy labels. To enhance the extraction of text features, we incorporate supervisory documents as privileged information during the assisted training phase. Specifically, we apply the Latent Dirichlet Allocation model to identify potential topic distributions to constrain text features. These supervisory documents, created by supervisors, focus on the counseling topics and the counselor's display of empathy. Notably, this privileged information is only available during training and is not accessible during the prediction phase. Experimental results on the multi-modal and dialogue empathy datasets demonstrate that our approach is superior to the existing methods.

Paper Structure

This paper contains 26 sections, 10 equations, 2 figures, 10 tables, 1 algorithm.

Figures (2)

  • Figure 1: An example of the supervisory documents. The left is the conversation in the counseling session, and the right is the corresponding supervisory document
  • Figure 2: The framework of Empathy Level Prediction in Multi-Modal Scenario with Supervisory Documentation Assistance method. It consists of two parts (a) Multi-Modal Empathy Prediction Network (b) Supervisory Documentation Assisted Training