Table of Contents
Fetching ...

SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in Speech

Aron R, Indra Sigicharla, Chirag Periwal, Mohanaprasad K, Nithya Darisini P S, Sourabh Tiwari, Shivani Arora

TL;DR

This work tackles the challenge of predicting age, gender, and emotion from speech using a unified multi-output SEGAA architecture. By merging CREMA-D and EMO-DB to obtain triple-labeled data and applying extensive feature extraction with data augmentation, the authors compare univariate, multi-output, and sequential approaches. Results show that multi-output SEGAA achieves performance close to independent models while offering runtime efficiency, and sequential cascades tend to propagate errors. The study provides evidence that leveraging interdependencies among vocal attributes can yield robust, efficient predictions suitable for real-time applications in diverse domains.

Abstract

The interpretation of human voices holds importance across various applications. This study ventures into predicting age, gender, and emotion from vocal cues, a field with vast applications. Voice analysis tech advancements span domains, from improving customer interactions to enhancing healthcare and retail experiences. Discerning emotions aids mental health, while age and gender detection are vital in various contexts. Exploring deep learning models for these predictions involves comparing single, multi-output, and sequential models highlighted in this paper. Sourcing suitable data posed challenges, resulting in the amalgamation of the CREMA-D and EMO-DB datasets. Prior work showed promise in individual predictions, but limited research considered all three variables simultaneously. This paper identifies flaws in an individual model approach and advocates for our novel multi-output learning architecture Speech-based Emotion Gender and Age Analysis (SEGAA) model. The experiments suggest that Multi-output models perform comparably to individual models, efficiently capturing the intricate relationships between variables and speech inputs, all while achieving improved runtime.

SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in Speech

TL;DR

This work tackles the challenge of predicting age, gender, and emotion from speech using a unified multi-output SEGAA architecture. By merging CREMA-D and EMO-DB to obtain triple-labeled data and applying extensive feature extraction with data augmentation, the authors compare univariate, multi-output, and sequential approaches. Results show that multi-output SEGAA achieves performance close to independent models while offering runtime efficiency, and sequential cascades tend to propagate errors. The study provides evidence that leveraging interdependencies among vocal attributes can yield robust, efficient predictions suitable for real-time applications in diverse domains.

Abstract

The interpretation of human voices holds importance across various applications. This study ventures into predicting age, gender, and emotion from vocal cues, a field with vast applications. Voice analysis tech advancements span domains, from improving customer interactions to enhancing healthcare and retail experiences. Discerning emotions aids mental health, while age and gender detection are vital in various contexts. Exploring deep learning models for these predictions involves comparing single, multi-output, and sequential models highlighted in this paper. Sourcing suitable data posed challenges, resulting in the amalgamation of the CREMA-D and EMO-DB datasets. Prior work showed promise in individual predictions, but limited research considered all three variables simultaneously. This paper identifies flaws in an individual model approach and advocates for our novel multi-output learning architecture Speech-based Emotion Gender and Age Analysis (SEGAA) model. The experiments suggest that Multi-output models perform comparably to individual models, efficiently capturing the intricate relationships between variables and speech inputs, all while achieving improved runtime.
Paper Structure (23 sections, 9 figures, 1 table)

This paper contains 23 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Workflow Diagram
  • Figure 2: Model architecture for Individual MLP
  • Figure 3: Model architecture for Individual SEGAA
  • Figure 4: Model architecture for Multi-output MLP
  • Figure 5: Model architecture for Multi-output SEGAA (Speech-based Emotion, Gender, Age Analysis)
  • ...and 4 more figures