VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin
Zhiqi Ai, Meixuan Bao, Zhiyong Chen, Zhi Yang, Xinnuo Li, Shugong Xu
TL;DR
VoxAging introduces a large-scale, longitudinal English–Mandarin dataset with weekly recordings spanning up to 17 years from 293 speakers, enabling dense analysis of aging effects on speaker verification. The authors compare past aging datasets, describe a multi-stage data collection/cleaning pipeline with dynamic templates, and evaluate ArcFace and seven SR models showing aging degrades verification performance over time. Results reveal decreasing speaker similarity and higher EERs with age, with Mandarin data generally more impacted than English, and demonstrate age-group and gender differences in aging effects. The work highlights that stronger SR models can better resist aging, and the dataset provides a valuable resource for studying aging dynamics and improving robust verification systems.
Abstract
The performance of speaker verification systems is adversely affected by speaker aging. However, due to challenges in data collection, particularly the lack of sustained and large-scale longitudinal data for individuals, research on speaker aging remains difficult. In this paper, we present VoxAging, a large-scale longitudinal dataset collected from 293 speakers (226 English speakers and 67 Mandarin speakers) over several years, with the longest time span reaching 17 years (approximately 900 weeks). For each speaker, the data were recorded at weekly intervals. We studied the phenomenon of speaker aging and its effects on advanced speaker verification systems, analyzed individual speaker aging processes, and explored the impact of factors such as age group and gender on speaker aging research.
