Bridging Modalities: Knowledge Distillation and Masked Training for Translating Multi-Modal Emotion Recognition to Uni-Modal, Speech-Only Emotion Recognition

Muhammad Muaz; Nathan Paull; Jahnavi Malagavalli

Bridging Modalities: Knowledge Distillation and Masked Training for Translating Multi-Modal Emotion Recognition to Uni-Modal, Speech-Only Emotion Recognition

Muhammad Muaz, Nathan Paull, Jahnavi Malagavalli

TL;DR

The paper addresses the challenge of translating multimodal emotion recognition models to practical, resource-efficient speech-only systems. It proposes two main approaches: knowledge distillation from a multimodal teacher (COGMEN) to a speech-only student, and masked training that grounds the model using partial modality information. Empirical results show that knowledge distillation can yield speech-only models with comparable performance to their multimodal teachers, while masked training can outperform both audio-only and audio–video baselines and even enable effective learning without a graph neural network. These findings offer a path to deploy capable emotion recognition in real-world, speech-centric scenarios, while highlighting hyper-parameter trade-offs and future directions for real-time and higher-modality extensions.

Abstract

This paper presents an innovative approach to address the challenges of translating multi-modal emotion recognition models to a more practical and resource-efficient uni-modal counterpart, specifically focusing on speech-only emotion recognition. Recognizing emotions from speech signals is a critical task with applications in human-computer interaction, affective computing, and mental health assessment. However, existing state-of-the-art models often rely on multi-modal inputs, incorporating information from multiple sources such as facial expressions and gestures, which may not be readily available or feasible in real-world scenarios. To tackle this issue, we propose a novel framework that leverages knowledge distillation and masked training techniques.

Bridging Modalities: Knowledge Distillation and Masked Training for Translating Multi-Modal Emotion Recognition to Uni-Modal, Speech-Only Emotion Recognition

TL;DR

Abstract

Paper Structure (15 sections, 2 equations, 6 figures, 5 tables)

This paper contains 15 sections, 2 equations, 6 figures, 5 tables.

Introduction
Related Work
Approach
Dataset
Model Architecture
Feature Embeddings
Knowledge Distillation
Masked Training
Results
Knowledge Distillation
Masked Training
Conclusion
Future Work
Input Masked Emotion Recognition without a GNN
Feature Embeddings Results and Analysis

Figures (6)

Figure 1: COGMEN Model Architecture cogmen
Figure 2: An illustration of the 4 input masking scenarios and their probabilities used during training.
Figure 3: Comparison of the confusion matrix for the teacher and student models.
Figure 4: Comparison of the confusion matrix for the COGMEN-A and COGMEN-Mask models.
Figure 5: Confusion matrices with HuBERT audio features on Base COGMEN
...and 1 more figures

Bridging Modalities: Knowledge Distillation and Masked Training for Translating Multi-Modal Emotion Recognition to Uni-Modal, Speech-Only Emotion Recognition

TL;DR

Abstract

Bridging Modalities: Knowledge Distillation and Masked Training for Translating Multi-Modal Emotion Recognition to Uni-Modal, Speech-Only Emotion Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (6)