Table of Contents
Fetching ...

CL-UZH submission to the NIST SRE 2024 Speaker Recognition Evaluation

Aref Farhadipour, Shiran Liu, Masoumeh Chapariniya, Valeriia Vyshnevetska, Srikanth Madikeri, Teodora Vukovic, Volker Dellwo

TL;DR

The paper tackles robust speaker recognition under the NIST SRE 2024 fixed and open conditions by leveraging multimodal cues. In the fixed condition, it combines an audio X-vector pipeline based on a TDNN with Kaldi VAD and PLDA, trained on the SRE21 CTS superset, with FaceNet-based visual paths for audio-visual and visual-only trials. In the open condition, pretrained networks trained on VoxBlink2 and VoxCeleb2 are used (ResNet293 for audio and ArcFace-based FaceNet for visuals), with trial-type specific calibration and a fusion strategy that favors audio-visual information, yielding lower EERs than single modalities. The results demonstrate the practical viability of modality-specific architectures and calibration for robust SRE, achieving strong dev/eval performance with low runtime factors that support real-time deployment potential.

Abstract

The CL-UZH team submitted one system each for the fixed and open conditions of the NIST SRE 2024 challenge. For the closed-set condition, results for the audio-only trials were achieved using the X-vector system developed with Kaldi. For the audio-visual results we used only models developed for the visual modality. Two sets of results were submitted for the open-set and closed-set conditions, one based on a pretrained model using the VoxBlink2 and VoxCeleb2 datasets. An Xvector-based model was trained from scratch using the CTS superset dataset for the closed set. In addition to the submission of the results of the SRE24 evaluation to the competition website, we talked about the performance of the proposed systems on the SRE24 evaluation in this report.

CL-UZH submission to the NIST SRE 2024 Speaker Recognition Evaluation

TL;DR

The paper tackles robust speaker recognition under the NIST SRE 2024 fixed and open conditions by leveraging multimodal cues. In the fixed condition, it combines an audio X-vector pipeline based on a TDNN with Kaldi VAD and PLDA, trained on the SRE21 CTS superset, with FaceNet-based visual paths for audio-visual and visual-only trials. In the open condition, pretrained networks trained on VoxBlink2 and VoxCeleb2 are used (ResNet293 for audio and ArcFace-based FaceNet for visuals), with trial-type specific calibration and a fusion strategy that favors audio-visual information, yielding lower EERs than single modalities. The results demonstrate the practical viability of modality-specific architectures and calibration for robust SRE, achieving strong dev/eval performance with low runtime factors that support real-time deployment potential.

Abstract

The CL-UZH team submitted one system each for the fixed and open conditions of the NIST SRE 2024 challenge. For the closed-set condition, results for the audio-only trials were achieved using the X-vector system developed with Kaldi. For the audio-visual results we used only models developed for the visual modality. Two sets of results were submitted for the open-set and closed-set conditions, one based on a pretrained model using the VoxBlink2 and VoxCeleb2 datasets. An Xvector-based model was trained from scratch using the CTS superset dataset for the closed set. In addition to the submission of the results of the SRE24 evaluation to the competition website, we talked about the performance of the proposed systems on the SRE24 evaluation in this report.

Paper Structure

This paper contains 6 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Describe what the figure shows. For example: Architecture of the proposed audio-visual speaker verification system, illustrating the fusion of X-vector and FaceNet embeddings.