Table of Contents
Fetching ...

Analysis of ABC Frontend Audio Systems for the NIST-SRE24

Sara Barahona, Anna Silnova, Ladislav Mošner, Junyi Peng, Oldřich Plchot, Johan Rohdin, Lin Zhang, Jiangyu Han, Petr Palka, Federico Landini, Lukáš Burget, Themos Stafylakis, Sandro Cumani, Dominik Boboš, Miroslav Hlavaček, Martin Kodovsky, Tomáš Pavlíček

TL;DR

The paper tackles robust speaker verification for telephony-dominated CTS data under fixed and open training-data constraints in the NIST SRE24 setting. It evaluates diverse embedding frontends, including ResNet-based architectures with xi-vector pooling, ReDimNet, and a self-supervised XLS-R backbone with MHFA, leveraging VoxBlink2 for open-condition training. Key findings show that ResNet frontends outperform ReDimNet in the fixed condition, while VoxBlink2 pretraining yields substantial gains in the open condition; mixing 16 kHz original data with 8 kHz downsampled data improves cross-domain generalization, and longer fine-tuning segments yield notable EER reductions (up to $23.98\%$) and $C_{primary}$ gains. The work suggests practical recipes for state-of-the-art CTS frontends and announces plans to release VoxBlink2 pre-trained models to support research and deployment in telephony speaker verification.

Abstract

We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-the-art frontends for speaker recognition.

Analysis of ABC Frontend Audio Systems for the NIST-SRE24

TL;DR

The paper tackles robust speaker verification for telephony-dominated CTS data under fixed and open training-data constraints in the NIST SRE24 setting. It evaluates diverse embedding frontends, including ResNet-based architectures with xi-vector pooling, ReDimNet, and a self-supervised XLS-R backbone with MHFA, leveraging VoxBlink2 for open-condition training. Key findings show that ResNet frontends outperform ReDimNet in the fixed condition, while VoxBlink2 pretraining yields substantial gains in the open condition; mixing 16 kHz original data with 8 kHz downsampled data improves cross-domain generalization, and longer fine-tuning segments yield notable EER reductions (up to ) and gains. The work suggests practical recipes for state-of-the-art CTS frontends and announces plans to release VoxBlink2 pre-trained models to support research and deployment in telephony speaker verification.

Abstract

We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-the-art frontends for speaker recognition.

Paper Structure

This paper contains 11 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: DCF plots for ResNet152-VB fine-tuned on segments of different lengths.