Empathy Level Prediction in Multi-Modal Scenario with Supervisory Documentation Assistance
Yufei Xiao, Shangfei Wang
TL;DR
This paper tackles empathy level prediction by extending beyond text to a fully multi-modal framework that also leverages supervisory documents as privileged information during training. It introduces a two-part architecture: a Multi-Modal Empathy Prediction Network for fused text, audio, and video features, and a Supervisory Documentation Assisted Training module that uses LDA-derived topic distributions to supervise text representations. Experiments on the MEDIC dataset show superior accuracy and F1 scores, with cross-dataset validation on the Mental Health Subreddits dataset demonstrating generalization. The approach offers a practical pathway to richer empathy understanding in counseling by integrating privileged supervisory insights without needing them at inference time.
Abstract
Prevalent empathy prediction techniques primarily concentrate on a singular modality, typically textual, thus neglecting multi-modal processing capabilities. They also overlook the utilization of certain privileged information, which may encompass additional empathetic content. In response, we introduce an advanced multi-modal empathy prediction method integrating video, audio, and text information. The method comprises the Multi-Modal Empathy Prediction and Supervisory Documentation Assisted Training. We use pre-trained networks in the empathy prediction network to extract features from various modalities, followed by a cross-modal fusion. This process yields a multi-modal feature representation, which is employed to predict empathy labels. To enhance the extraction of text features, we incorporate supervisory documents as privileged information during the assisted training phase. Specifically, we apply the Latent Dirichlet Allocation model to identify potential topic distributions to constrain text features. These supervisory documents, created by supervisors, focus on the counseling topics and the counselor's display of empathy. Notably, this privileged information is only available during training and is not accessible during the prediction phase. Experimental results on the multi-modal and dialogue empathy datasets demonstrate that our approach is superior to the existing methods.
