Demographic Attributes Prediction from Speech Using WavLM Embeddings
Yuchen Yang, Thomas Thebaud, Najim Dehak
TL;DR
The paper demonstrates that WavLM embeddings provide robust, generalizable representations for predicting speaker demographics from speech, achieving notable gains in age MAE and gender accuracy across diverse datasets. By coupling frozen WavLM features with simple heads (MLP/LSTM/ResNet32) and evaluating on five heterogeneous corpora, the study shows cross-dataset improvements over traditional baselines such as i-vectors and x-vectors, highlighting the potential and challenges of speech-based demographic profiling. Key contributions include cross-dataset benchmarking, demonstration of strong performance for age and gender prediction, and insights into the limitations of heterogeneous labels, bias, and domain shift. The findings support the feasibility of privacy-conscious, adaptive speech systems with demographic-aware behavior, while calling for standardized datasets and label ontologies to enhance robustness and fair usage in real-world applications.
Abstract
This paper introduces a general classifier based on WavLM features, to infer demographic characteristics, such as age, gender, native language, education, and country, from speech. Demographic feature prediction plays a crucial role in applications like language learning, accessibility, and digital forensics, enabling more personalized and inclusive technologies. Leveraging pretrained models for embedding extraction, the proposed framework identifies key acoustic and linguistic fea-tures associated with demographic attributes, achieving a Mean Absolute Error (MAE) of 4.94 for age prediction and over 99.81% accuracy for gender classification across various datasets. Our system improves upon existing models by up to relative 30% in MAE and up to relative 10% in accuracy and F1 scores across tasks, leveraging a diverse range of datasets and large pretrained models to ensure robustness and generalizability. This study offers new insights into speaker diversity and provides a strong foundation for future research in speech-based demographic profiling.
