Table of Contents
Fetching ...

Enhancing Age-Related Robustness in Children Speaker Verification

Vishwas M. Shetty, Jiusi Zheng, Steven M. Lulich, Abeer Alwan

TL;DR

The paper tackles age-related variability in children's speaker verification by introducing two robustness strategies: a Feature Transform Adapter (FTA) that stabilizes local-to-global feature representations, and Synthetic Audio Augmentation (SAA) using HiFi-GAN to diversify training data. It also presents a longitudinal child speech test set (IU) to evaluate inter-year robustness. Key contributions include the FTA architecture with residual integration, the HiFi-GAN-based SAA approach, and a new longitudinal dataset; ablations show significant inter-year EER reductions (up to 19.4%) with combined methods, albeit with some intra-year trade-offs. The work demonstrates practical gains for long-term child voice verification in educational and interactive settings, enabling more reliable enrollment and verification across years.

Abstract

One of the main challenges in children's speaker verification (C-SV) is the significant change in children's voices as they grow. In this paper, we propose two approaches to improve age-related robustness in C-SV. We first introduce a Feature Transform Adapter (FTA) module that integrates local patterns into higher-level global representations, reducing overfitting to specific local features and improving the inter-year SV performance of the system. We then employ Synthetic Audio Augmentation (SAA) to increase data diversity and size, thereby improving robustness against age-related changes. Since the lack of longitudinal speech datasets makes it difficult to measure age-related robustness of C-SV systems, we introduce a longitudinal dataset to assess inter-year verification robustness of C-SV systems. By integrating both of our proposed methods, the average equal error rate was reduced by 19.4%, 13.0%, and 6.1% in the one-year, two-year, and three-year gap inter-year evaluation sets, respectively, compared to the baseline.

Enhancing Age-Related Robustness in Children Speaker Verification

TL;DR

The paper tackles age-related variability in children's speaker verification by introducing two robustness strategies: a Feature Transform Adapter (FTA) that stabilizes local-to-global feature representations, and Synthetic Audio Augmentation (SAA) using HiFi-GAN to diversify training data. It also presents a longitudinal child speech test set (IU) to evaluate inter-year robustness. Key contributions include the FTA architecture with residual integration, the HiFi-GAN-based SAA approach, and a new longitudinal dataset; ablations show significant inter-year EER reductions (up to 19.4%) with combined methods, albeit with some intra-year trade-offs. The work demonstrates practical gains for long-term child voice verification in educational and interactive settings, enabling more reliable enrollment and verification across years.

Abstract

One of the main challenges in children's speaker verification (C-SV) is the significant change in children's voices as they grow. In this paper, we propose two approaches to improve age-related robustness in C-SV. We first introduce a Feature Transform Adapter (FTA) module that integrates local patterns into higher-level global representations, reducing overfitting to specific local features and improving the inter-year SV performance of the system. We then employ Synthetic Audio Augmentation (SAA) to increase data diversity and size, thereby improving robustness against age-related changes. Since the lack of longitudinal speech datasets makes it difficult to measure age-related robustness of C-SV systems, we introduce a longitudinal dataset to assess inter-year verification robustness of C-SV systems. By integrating both of our proposed methods, the average equal error rate was reduced by 19.4%, 13.0%, and 6.1% in the one-year, two-year, and three-year gap inter-year evaluation sets, respectively, compared to the baseline.

Paper Structure

This paper contains 15 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of the proposed Feature Transform Adapter