Table of Contents
Fetching ...

Bridging Modalities: Knowledge Distillation and Masked Training for Translating Multi-Modal Emotion Recognition to Uni-Modal, Speech-Only Emotion Recognition

Muhammad Muaz, Nathan Paull, Jahnavi Malagavalli

TL;DR

The paper addresses the challenge of translating multimodal emotion recognition models to practical, resource-efficient speech-only systems. It proposes two main approaches: knowledge distillation from a multimodal teacher (COGMEN) to a speech-only student, and masked training that grounds the model using partial modality information. Empirical results show that knowledge distillation can yield speech-only models with comparable performance to their multimodal teachers, while masked training can outperform both audio-only and audio–video baselines and even enable effective learning without a graph neural network. These findings offer a path to deploy capable emotion recognition in real-world, speech-centric scenarios, while highlighting hyper-parameter trade-offs and future directions for real-time and higher-modality extensions.

Abstract

This paper presents an innovative approach to address the challenges of translating multi-modal emotion recognition models to a more practical and resource-efficient uni-modal counterpart, specifically focusing on speech-only emotion recognition. Recognizing emotions from speech signals is a critical task with applications in human-computer interaction, affective computing, and mental health assessment. However, existing state-of-the-art models often rely on multi-modal inputs, incorporating information from multiple sources such as facial expressions and gestures, which may not be readily available or feasible in real-world scenarios. To tackle this issue, we propose a novel framework that leverages knowledge distillation and masked training techniques.

Bridging Modalities: Knowledge Distillation and Masked Training for Translating Multi-Modal Emotion Recognition to Uni-Modal, Speech-Only Emotion Recognition

TL;DR

The paper addresses the challenge of translating multimodal emotion recognition models to practical, resource-efficient speech-only systems. It proposes two main approaches: knowledge distillation from a multimodal teacher (COGMEN) to a speech-only student, and masked training that grounds the model using partial modality information. Empirical results show that knowledge distillation can yield speech-only models with comparable performance to their multimodal teachers, while masked training can outperform both audio-only and audio–video baselines and even enable effective learning without a graph neural network. These findings offer a path to deploy capable emotion recognition in real-world, speech-centric scenarios, while highlighting hyper-parameter trade-offs and future directions for real-time and higher-modality extensions.

Abstract

This paper presents an innovative approach to address the challenges of translating multi-modal emotion recognition models to a more practical and resource-efficient uni-modal counterpart, specifically focusing on speech-only emotion recognition. Recognizing emotions from speech signals is a critical task with applications in human-computer interaction, affective computing, and mental health assessment. However, existing state-of-the-art models often rely on multi-modal inputs, incorporating information from multiple sources such as facial expressions and gestures, which may not be readily available or feasible in real-world scenarios. To tackle this issue, we propose a novel framework that leverages knowledge distillation and masked training techniques.
Paper Structure (15 sections, 2 equations, 6 figures, 5 tables)

This paper contains 15 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: COGMEN Model Architecture cogmen
  • Figure 2: An illustration of the 4 input masking scenarios and their probabilities used during training.
  • Figure 3: Comparison of the confusion matrix for the teacher and student models.
  • Figure 4: Comparison of the confusion matrix for the COGMEN-A and COGMEN-Mask models.
  • Figure 5: Confusion matrices with HuBERT audio features on Base COGMEN
  • ...and 1 more figures