Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits
Tiantian Feng, Jihwan Lee, Anfeng Xu, Yoonjeong Lee, Thanathai Lertpetchpun, Xuan Shi, Helin Wang, Thomas Thebaud, Laureano Moro-Velazquez, Dani Byrd, Najim Dehak, Shrikanth Narayanan
TL;DR
Vox-Profile presents a holistic benchmark to characterize rich speaker and speech traits using speech foundation models, integrating static traits such as $age$, $sex$, and $accent$ with dynamic traits like $emotion$, $speechlow$, and $expressiveness$ within a linguistically informed taxonomy. The framework is evaluated across 15+ datasets using models such as Whisper, HuBERT, WavLM, and ECAPA-TDNN, and it demonstrates applicability to downstream tasks including ASR performance analysis, evaluation of speech generation systems, and generation of synthetic speaking style prompts. Empirical results show larger models yield stronger trait predictions, with Whisper variants often excelling on static traits, while dynamic traits present greater challenges; ensembles provide additional gains and Vox-Profile can produce reliable synthetic metadata that tracks ASR performance trends similarly to ground truth labels. The platform also includes automated tools for evaluating generation fidelity and human-centric assessments of synthetic prompts, confirming the utility of Vox-Profile for model analysis, qualitative evaluation, and prompt synthesis, while acknowledging limitations in accent precision and multilingual extension. Overall, Vox-Profile offers a scalable, linguistically grounded, multi-trait benchmark that enables nuanced analysis and practical applications in speech technology, with future work extending multilingual coverage and broader model architectures.
Abstract
We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. Unlike existing works that focus on a single dimension of speaker traits, Vox-Profile provides holistic and multi-dimensional profiles that reflect both static speaker traits (e.g., age, sex, accent) and dynamic speech properties (e.g., emotion, speech flow). This benchmark is grounded in speech science and linguistics, developed with domain experts to accurately index speaker and speech characteristics. We report benchmark experiments using over 15 publicly available speech datasets and several widely used speech foundation models that target various static and dynamic speaker and speech properties. In addition to benchmark experiments, we showcase several downstream applications supported by Vox-Profile. First, we show that Vox-Profile can augment existing speech recognition datasets to analyze ASR performance variability. Vox-Profile is also used as a tool to evaluate the performance of speech generation systems. Finally, we assess the quality of our automated profiles through comparison with human evaluation and show convergent validity. Vox-Profile is publicly available at: https://github.com/tiantiaf0627/vox-profile-release.
