Table of Contents
Fetching ...

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

Debaditya Shome, Ali Etemad

Abstract

We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each component of our method.

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

Abstract

We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each component of our method.
Paper Structure (8 sections, 5 equations, 2 figures, 2 tables)

This paper contains 8 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: EmoDistill Framework. Our student network is trained using a distillation of logit-level and embedding-level knowledge from frozen linguistic and prosodic teacher networks, along with standard cross-entropy loss. During inference, we only use the student network in a unimodal setup, avoiding computational overhead as well as transcription and prosodic feature extraction errors.
  • Figure 2: Left: We remove $f^{L}_{T}$ and vary $\tau_{P}$. Right: We remove $f^{P}_{T}$ and vary $\tau_{L}$.