Voxceleb-ESP: preliminary experiments detecting Spanish celebrities from their voices
Beltrán Labrador, Manuel Otero-Gonzalez, Alicia Lozano-Diez, Daniel Ramos, Doroteo T. Toledano, Joaquin Gonzalez-Rodriguez
TL;DR
The paper introduces VoxCeleb-ESP, a Spanish in-the-wild speaker recognition dataset derived from YouTube, extending the VoxCeleb paradigm to Spanish with 160 celebrities and about 7 hours of speech across varied acoustic conditions. It provides two speaker identification trial lists (A: same-video targets; B: cross-video targets) and performs cross-lingual evaluation using ResNet pretrained models from VoxCeleb-2, establishing a baseline on the new data. The results show that English-trained models can generalize to Spanish with reasonable EERs, though cross-video trials are notably harder and language-specific adaptation (e.g., PLDA or language data) could improve performance. VoxCeleb-ESP broadens multilingual benchmarks for speaker recognition and supports research into cross-language transfer in realistic scenarios.
Abstract
This paper presents VoxCeleb-ESP, a collection of pointers and timestamps to YouTube videos facilitating the creation of a novel speaker recognition dataset. VoxCeleb-ESP captures real-world scenarios, incorporating diverse speaking styles, noises, and channel distortions. It includes 160 Spanish celebrities spanning various categories, ensuring a representative distribution across age groups and geographic regions in Spain. We provide two speaker trial lists for speaker identification tasks, each of them with same-video or different-video target trials respectively, accompanied by a cross-lingual evaluation of ResNet pretrained models. Preliminary speaker identification results suggest that the complexity of the detection task in VoxCeleb-ESP is equivalent to that of the original and much larger VoxCeleb in English. VoxCeleb-ESP contributes to the expansion of speaker recognition benchmarks with a comprehensive and diverse dataset for the Spanish language.
