Table of Contents
Fetching ...

Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits

Tiantian Feng, Jihwan Lee, Anfeng Xu, Yoonjeong Lee, Thanathai Lertpetchpun, Xuan Shi, Helin Wang, Thomas Thebaud, Laureano Moro-Velazquez, Dani Byrd, Najim Dehak, Shrikanth Narayanan

TL;DR

Vox-Profile presents a holistic benchmark to characterize rich speaker and speech traits using speech foundation models, integrating static traits such as $age$, $sex$, and $accent$ with dynamic traits like $emotion$, $speech low$, and $expressiveness$ within a linguistically informed taxonomy. The framework is evaluated across 15+ datasets using models such as Whisper, HuBERT, WavLM, and ECAPA-TDNN, and it demonstrates applicability to downstream tasks including ASR performance analysis, evaluation of speech generation systems, and generation of synthetic speaking style prompts. Empirical results show larger models yield stronger trait predictions, with Whisper variants often excelling on static traits, while dynamic traits present greater challenges; ensembles provide additional gains and Vox-Profile can produce reliable synthetic metadata that tracks ASR performance trends similarly to ground truth labels. The platform also includes automated tools for evaluating generation fidelity and human-centric assessments of synthetic prompts, confirming the utility of Vox-Profile for model analysis, qualitative evaluation, and prompt synthesis, while acknowledging limitations in accent precision and multilingual extension. Overall, Vox-Profile offers a scalable, linguistically grounded, multi-trait benchmark that enables nuanced analysis and practical applications in speech technology, with future work extending multilingual coverage and broader model architectures.

Abstract

We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. Unlike existing works that focus on a single dimension of speaker traits, Vox-Profile provides holistic and multi-dimensional profiles that reflect both static speaker traits (e.g., age, sex, accent) and dynamic speech properties (e.g., emotion, speech flow). This benchmark is grounded in speech science and linguistics, developed with domain experts to accurately index speaker and speech characteristics. We report benchmark experiments using over 15 publicly available speech datasets and several widely used speech foundation models that target various static and dynamic speaker and speech properties. In addition to benchmark experiments, we showcase several downstream applications supported by Vox-Profile. First, we show that Vox-Profile can augment existing speech recognition datasets to analyze ASR performance variability. Vox-Profile is also used as a tool to evaluate the performance of speech generation systems. Finally, we assess the quality of our automated profiles through comparison with human evaluation and show convergent validity. Vox-Profile is publicly available at: https://github.com/tiantiaf0627/vox-profile-release.

Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits

TL;DR

Vox-Profile presents a holistic benchmark to characterize rich speaker and speech traits using speech foundation models, integrating static traits such as , , and with dynamic traits like , , and within a linguistically informed taxonomy. The framework is evaluated across 15+ datasets using models such as Whisper, HuBERT, WavLM, and ECAPA-TDNN, and it demonstrates applicability to downstream tasks including ASR performance analysis, evaluation of speech generation systems, and generation of synthetic speaking style prompts. Empirical results show larger models yield stronger trait predictions, with Whisper variants often excelling on static traits, while dynamic traits present greater challenges; ensembles provide additional gains and Vox-Profile can produce reliable synthetic metadata that tracks ASR performance trends similarly to ground truth labels. The platform also includes automated tools for evaluating generation fidelity and human-centric assessments of synthetic prompts, confirming the utility of Vox-Profile for model analysis, qualitative evaluation, and prompt synthesis, while acknowledging limitations in accent precision and multilingual extension. Overall, Vox-Profile offers a scalable, linguistically grounded, multi-trait benchmark that enables nuanced analysis and practical applications in speech technology, with future work extending multilingual coverage and broader model architectures.

Abstract

We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. Unlike existing works that focus on a single dimension of speaker traits, Vox-Profile provides holistic and multi-dimensional profiles that reflect both static speaker traits (e.g., age, sex, accent) and dynamic speech properties (e.g., emotion, speech flow). This benchmark is grounded in speech science and linguistics, developed with domain experts to accurately index speaker and speech characteristics. We report benchmark experiments using over 15 publicly available speech datasets and several widely used speech foundation models that target various static and dynamic speaker and speech properties. In addition to benchmark experiments, we showcase several downstream applications supported by Vox-Profile. First, we show that Vox-Profile can augment existing speech recognition datasets to analyze ASR performance variability. Vox-Profile is also used as a tool to evaluate the performance of speech generation systems. Finally, we assess the quality of our automated profiles through comparison with human evaluation and show convergent validity. Vox-Profile is publicly available at: https://github.com/tiantiaf0627/vox-profile-release.

Paper Structure

This paper contains 50 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview of the proposed Vox-Profile Benchmark and its applications. We highlight three primary use cases: (1) speech model (such as ASR) performance analysis, (2) automated speech generation evaluation, and (3) automated speaking style tagging.
  • Figure 2: ASR performance trends, grouped by ground truth and predicted labels by Vox-Profile. We measure WER, stratified by accent and emotion labels. We observe similar performance trends between the predicted and ground truth trait labels.
  • Figure 3: Comparing human evaluation results between synthetic speaking style prompts created by Vox-Profile and human-annotated prompts from ParaSpeechCaps. Human raters provided their preferences across overall prompt quality, sex, age, accent, voice quality, fluency, and emotion.
  • Figure 4: Accent label distribution
  • Figure 5: Prompt example for generating speaking style.
  • ...and 2 more figures