PERSONA: An Application for Emotion Recognition, Gender Recognition and Age Estimation
Devyani Koshal, Orchid Chetia Phukan, Sarthak Jain, Arun Balaji Buduru, Rajesh Sharma
TL;DR
The paper tackles the challenge of jointly predicting emotion, gender, and age from speech while minimizing resource expenditure by using a single model. It compares representations from a self-supervised WavLM PTM and a speaker-recognition–focused x-vector, finding that x-vector features yield superior multi-task performance when combined with a CNN-FCN backbone. The proposed PERSONA system demonstrates end-to-end practicality with a React/Flask deployment and 16 kHz audio preprocessing, achieving real-time inference suitable for interactive applications. Overall, the work highlights the effectiveness of leveraging speaker-recognition representations for consolidated paralinguistic inference and suggests a path toward more scalable, multi-task speech analytics.
Abstract
Emotion Recognition (ER), Gender Recognition (GR), and Age Estimation (AE) constitute paralinguistic tasks that rely not on the spoken content but primarily on speech characteristics such as pitch and tone. While previous research has made significant strides in developing models for each task individually, there has been comparatively less emphasis on concurrently learning these tasks, despite their inherent interconnectedness. As such in this demonstration, we present PERSONA, an application for predicting ER, GR, and AE with a single model in the backend. One notable point is we show that representations from speaker recognition pre-trained model (PTM) is better suited for such a multi-task learning format than the state-of-the-art (SOTA) self-supervised (SSL) PTM by carrying out a comparative study. Our methodology obviates the need for deploying separate models for each task and can potentially conserve resources and time during the training and deployment phases.
