Dataset-Distillation Generative Model for Speech Emotion Recognition

Fabian Ritter-Gutierrez; Kuan-Po Huang; Jeremy H. M Wong; Dianwen Ng; Hung-yi Lee; Nancy F. Chen; Eng Siong Chng

Dataset-Distillation Generative Model for Speech Emotion Recognition

Fabian Ritter-Gutierrez, Kuan-Po Huang, Jeremy H. M Wong, Dianwen Ng, Hung-yi Lee, Nancy F. Chen, Eng Siong Chng

TL;DR

We address the data-hungry nature of speech emotion recognition by introducing Dataset Distillation (DD) for speech via a GAN-based distillator, applied to IEMOCAP. The method trains a small generator using WGAN-GP with feature matching, conditioning on emotion labels, and adds a softmax-probability matching loss $L_{ ext{SML}}$ and a diversity penalty $L_{ ext{DIV}}$ to produce discriminative synthetic samples; the DD objective combines these terms as $L_{ ext{G}_{ ext{DD}}}$. Experiments show that 50–100 points per class (roughly 5.6–11.2% of the original data) yield UAR competitive with real-data training, with substantial reductions in storage (~95%) and training time (~95%), and better class-balanced performance. Additionally, the approach intrinsically reduces speaker-identification information, suggesting privacy-preserving potential, and offers a path toward scaling to larger datasets.

Abstract

Deep learning models for speech rely on large datasets, presenting computational challenges. Yet, performance hinges on training data size. Dataset Distillation (DD) aims to learn a smaller dataset without much performance degradation when training with it. DD has been investigated in computer vision but not yet in speech. This paper presents the first approach for DD to speech targeting Speech Emotion Recognition on IEMOCAP. We employ Generative Adversarial Networks (GANs) not to mimic real data but to distil key discriminative information of IEMOCAP that is useful for downstream training. The GAN then replaces the original dataset and can sample custom synthetic dataset sizes. It performs comparably when following the original class imbalance but improves performance by 0.3% absolute UAR with balanced classes. It also reduces dataset storage and accelerates downstream training by 95% in both cases and reduces speaker information which could help for a privacy application.

Dataset-Distillation Generative Model for Speech Emotion Recognition

TL;DR

and a diversity penalty

to produce discriminative synthetic samples; the DD objective combines these terms as

. Experiments show that 50–100 points per class (roughly 5.6–11.2% of the original data) yield UAR competitive with real-data training, with substantial reductions in storage (~95%) and training time (~95%), and better class-balanced performance. Additionally, the approach intrinsically reduces speaker-identification information, suggesting privacy-preserving potential, and offers a path toward scaling to larger datasets.

Abstract

Paper Structure (10 sections, 8 equations, 2 figures, 4 tables)

This paper contains 10 sections, 8 equations, 2 figures, 4 tables.

Introduction
Related Work
Dataset Distillation
Dataset Distillation for Speech Emotion
Experiments
Implementation details
GAN as a dataset distillator
On the privacy aspect
Conclusions
Acknowledgments

Figures (2)

Figure 1: Usage scenario for DD on speech processing tasks. $\boldsymbol{f}_{\mathbf{T}}(\boldsymbol{x}_{test})$ represents inference on a downstream model $\boldsymbol{f}$ trained under dataset $\mathbf{T}$.
Figure 2: Schematic representation of the proposed DD. The blue dashed lines represent the standard training of a GAN

Dataset-Distillation Generative Model for Speech Emotion Recognition

TL;DR

Abstract

Dataset-Distillation Generative Model for Speech Emotion Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (2)