Table of Contents
Fetching ...

3M-Health: Multimodal Multi-Teacher Knowledge Distillation for Mental Health Detection

Rina Carines Cabral, Siwen Luo, Josiah Poon, Soyeon Caren Han

TL;DR

This work introduces a Multimodal and Multi-Teacher Knowledge Distillation model for Mental Health Classification, leveraging insights from cross-modal human understanding, and addresses the challenge of appropriately representing inputs of varying natures.

Abstract

The significance of mental health classification is paramount in contemporary society, where digital platforms serve as crucial sources for monitoring individuals' well-being. However, existing social media mental health datasets primarily consist of text-only samples, potentially limiting the efficacy of models trained on such data. Recognising that humans utilise cross-modal information to comprehend complex situations or issues, we present a novel approach to address the limitations of current methodologies. In this work, we introduce a Multimodal and Multi-Teacher Knowledge Distillation model for Mental Health Classification, leveraging insights from cross-modal human understanding. Unlike conventional approaches that often rely on simple concatenation to integrate diverse features, our model addresses the challenge of appropriately representing inputs of varying natures (e.g., texts and sounds). To mitigate the computational complexity associated with integrating all features into a single model, we employ a multimodal and multi-teacher architecture. By distributing the learning process across multiple teachers, each specialising in a particular feature extraction aspect, we enhance the overall mental health classification performance. Through experimental validation, we demonstrate the efficacy of our model in achieving improved performance.

3M-Health: Multimodal Multi-Teacher Knowledge Distillation for Mental Health Detection

TL;DR

This work introduces a Multimodal and Multi-Teacher Knowledge Distillation model for Mental Health Classification, leveraging insights from cross-modal human understanding, and addresses the challenge of appropriately representing inputs of varying natures.

Abstract

The significance of mental health classification is paramount in contemporary society, where digital platforms serve as crucial sources for monitoring individuals' well-being. However, existing social media mental health datasets primarily consist of text-only samples, potentially limiting the efficacy of models trained on such data. Recognising that humans utilise cross-modal information to comprehend complex situations or issues, we present a novel approach to address the limitations of current methodologies. In this work, we introduce a Multimodal and Multi-Teacher Knowledge Distillation model for Mental Health Classification, leveraging insights from cross-modal human understanding. Unlike conventional approaches that often rely on simple concatenation to integrate diverse features, our model addresses the challenge of appropriately representing inputs of varying natures (e.g., texts and sounds). To mitigate the computational complexity associated with integrating all features into a single model, we employ a multimodal and multi-teacher architecture. By distributing the learning process across multiple teachers, each specialising in a particular feature extraction aspect, we enhance the overall mental health classification performance. Through experimental validation, we demonstrate the efficacy of our model in achieving improved performance.
Paper Structure (24 sections, 6 figures, 10 tables)

This paper contains 24 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Architecture of 3M-Health: Multimodal Multi-teacher Knowledge Distillation for Mental Health Detection.
  • Figure 2: Class distribution. For (a) TwitSuicide, SI: Safe to Ignore; PC: Possibly Concerning; SC: Strongly Concerning. For (b) DEPTWEET, ND: Non-depression; MI: Mild; MO: Moderate; SE: Severe. For (c) IdenDep, NDE: Non-depression; DE: Depression. For (d) SDCNL, DEP: Depression; SUI: Suicide.
  • Figure 3: Audio length comparison. ch: character average
  • Figure 4: Audio analysis using PCA on spectrogram images of audio samples grouped by a maximum of 10s (left) and 10-25s (right). Each sample is labelled with an ID for reference to corresponding texts provided in the Supplementary Material.
  • Figure 5: Distribution of multi-label emotion class labels.
  • ...and 1 more figures