Table of Contents
Fetching ...

HeightCeleb - an enrichment of VoxCeleb dataset with speaker height information

Stanisław Kacprzak, Konrad Kowalczyk

TL;DR

This work tackles the scarcity of freely available, large-scale height annotations for speaker data by introducing HeightCeleb, an enrichment of VoxCeleb that attaches height information to 1251 speakers using public sources. It demonstrates a practical pipeline where embeddings from a pre-trained ECAPA-TDNN model are regressed to height using simple methods (MLR/PLSR), with gender-specific modeling via a lightweight classifier, achieving competitive height estimation on the TIMIT dataset without height-specific fine-tuning. The HeightCeleb dataset enables training and evaluation for voice-based height estimation at scale and highlights the importance of additional error metrics beyond MAE/RMSE. Overall, the approach provides a readily usable benchmark resource and a baseline that rivals state-of-the-art methods, accelerating research in speech-based biometric trait estimation.

Abstract

Prediction of speaker's height is of interest for voice forensics, surveillance, and automatic speaker profiling. Until now, TIMIT has been the most popular dataset for training and evaluation of the height estimation methods. In this paper, we introduce HeightCeleb, an extension to VoxCeleb, which is the dataset commonly used in speaker recognition tasks. This enrichment consists in adding information about the height of all 1251 speakers from VoxCeleb that has been extracted with an automated method from publicly available sources. Such annotated data will enable the research community to utilize freely available speaker embedding extractors, pre-trained on VoxCeleb, to build more efficient speaker height estimators. In this work, we describe the creation of the HeightCeleb dataset and show that using it enables to achieve state-of-the-art results on the TIMIT test set by using simple statistical regression methods and embeddings obtained with a popular speaker model (without any additional fine-tuning).

HeightCeleb - an enrichment of VoxCeleb dataset with speaker height information

TL;DR

This work tackles the scarcity of freely available, large-scale height annotations for speaker data by introducing HeightCeleb, an enrichment of VoxCeleb that attaches height information to 1251 speakers using public sources. It demonstrates a practical pipeline where embeddings from a pre-trained ECAPA-TDNN model are regressed to height using simple methods (MLR/PLSR), with gender-specific modeling via a lightweight classifier, achieving competitive height estimation on the TIMIT dataset without height-specific fine-tuning. The HeightCeleb dataset enables training and evaluation for voice-based height estimation at scale and highlights the importance of additional error metrics beyond MAE/RMSE. Overall, the approach provides a readily usable benchmark resource and a baseline that rivals state-of-the-art methods, accelerating research in speech-based biometric trait estimation.

Abstract

Prediction of speaker's height is of interest for voice forensics, surveillance, and automatic speaker profiling. Until now, TIMIT has been the most popular dataset for training and evaluation of the height estimation methods. In this paper, we introduce HeightCeleb, an extension to VoxCeleb, which is the dataset commonly used in speaker recognition tasks. This enrichment consists in adding information about the height of all 1251 speakers from VoxCeleb that has been extracted with an automated method from publicly available sources. Such annotated data will enable the research community to utilize freely available speaker embedding extractors, pre-trained on VoxCeleb, to build more efficient speaker height estimators. In this work, we describe the creation of the HeightCeleb dataset and show that using it enables to achieve state-of-the-art results on the TIMIT test set by using simple statistical regression methods and embeddings obtained with a popular speaker model (without any additional fine-tuning).

Paper Structure

This paper contains 11 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Speaker's height histograms across different datasets.
  • Figure 2: Example of 'answer box' powered by Google Knowledge Graph, a prime source of the collected height data.
  • Figure 3: Empirical cumulative distribution function (eCDF) for height estimation error that falls within a predefined range.