Table of Contents
Fetching ...

VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin

Zhiqi Ai, Meixuan Bao, Zhiyong Chen, Zhi Yang, Xinnuo Li, Shugong Xu

TL;DR

VoxAging introduces a large-scale, longitudinal English–Mandarin dataset with weekly recordings spanning up to 17 years from 293 speakers, enabling dense analysis of aging effects on speaker verification. The authors compare past aging datasets, describe a multi-stage data collection/cleaning pipeline with dynamic templates, and evaluate ArcFace and seven SR models showing aging degrades verification performance over time. Results reveal decreasing speaker similarity and higher EERs with age, with Mandarin data generally more impacted than English, and demonstrate age-group and gender differences in aging effects. The work highlights that stronger SR models can better resist aging, and the dataset provides a valuable resource for studying aging dynamics and improving robust verification systems.

Abstract

The performance of speaker verification systems is adversely affected by speaker aging. However, due to challenges in data collection, particularly the lack of sustained and large-scale longitudinal data for individuals, research on speaker aging remains difficult. In this paper, we present VoxAging, a large-scale longitudinal dataset collected from 293 speakers (226 English speakers and 67 Mandarin speakers) over several years, with the longest time span reaching 17 years (approximately 900 weeks). For each speaker, the data were recorded at weekly intervals. We studied the phenomenon of speaker aging and its effects on advanced speaker verification systems, analyzed individual speaker aging processes, and explored the impact of factors such as age group and gender on speaker aging research.

VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin

TL;DR

VoxAging introduces a large-scale, longitudinal English–Mandarin dataset with weekly recordings spanning up to 17 years from 293 speakers, enabling dense analysis of aging effects on speaker verification. The authors compare past aging datasets, describe a multi-stage data collection/cleaning pipeline with dynamic templates, and evaluate ArcFace and seven SR models showing aging degrades verification performance over time. Results reveal decreasing speaker similarity and higher EERs with age, with Mandarin data generally more impacted than English, and demonstrate age-group and gender differences in aging effects. The work highlights that stronger SR models can better resist aging, and the dataset provides a valuable resource for studying aging dynamics and improving robust verification systems.

Abstract

The performance of speaker verification systems is adversely affected by speaker aging. However, due to challenges in data collection, particularly the lack of sustained and large-scale longitudinal data for individuals, research on speaker aging remains difficult. In this paper, we present VoxAging, a large-scale longitudinal dataset collected from 293 speakers (226 English speakers and 67 Mandarin speakers) over several years, with the longest time span reaching 17 years (approximately 900 weeks). For each speaker, the data were recorded at weekly intervals. We studied the phenomenon of speaker aging and its effects on advanced speaker verification systems, analyzed individual speaker aging processes, and explored the impact of factors such as age group and gender on speaker aging research.

Paper Structure

This paper contains 13 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Previous short-term datasets have continuous intervals but limited time spans, while long-term datasets have long time spans with discrete intervals, both with sparse sampling. The VoxAging offers dense sampling, continuous weekly intervals, long time spans, and multi-modal data.
  • Figure 2: VoxAging dataset distribution: (a) timespan distribution, (b) duration distribution. In VoxAging, there are 293 speakers: 226 English speakers (112 female and 114 male) and 67 Mandarin speakers (23 female and 44 male).
  • Figure 3: Illustration of the collection pipeline.
  • Figure 4: Speaker similarity scores over time in VoxAging. Dashed black line indicates the average aging trend.