Table of Contents
Fetching ...

ASASVIcomtech: The Vicomtech-UGR Speech Deepfake Detection and SASV Systems for the ASVspoof5 Challenge

Juan M. Martín-Doñas, Eros Roselló, Angel M. Gomez, Aitor Álvarez, Iván López-Espejo, Antonio M. Peinado

TL;DR

This work analyzes the ASVspoof5 dataset to mitigate biases and designs both closed- and open-condition systems for spoofing detection and SASV. The closed-condition track employs a DCCRN-based model on full-spectrum STFT features but yields limited generalization, while the open-condition tracks achieve strong performance through ensemble self-supervised learning models (Wav2Vec2-Large, WavLM-Base) and data augmentation with vocoders. A Siamese-like fusion approach combines CM and ASV outputs via calibrated LLRs and non-linear fusion (LSE), delivering competitive Track 1 minDCF/EER and Track 2 a-DCF results, demonstrating robustness to a range of spoofing attacks and codecs. Calibration and diverse pretraining data prove pivotal for effective SASV integration, suggesting practical impact for robust, real-world anti-spoofing and speaker verification systems. Future work will address state-of-the-art TTS/VC robustness, broader codec/narrowband augmentation, and adversarial defenses to further strengthen system reliability.

Abstract

This paper presents the work carried out by the ASASVIcomtech team, made up of researchers from Vicomtech and University of Granada, for the ASVspoof5 Challenge. The team has participated in both Track 1 (speech deepfake detection) and Track 2 (spoofing-aware speaker verification). This work started with an analysis of the challenge available data, which was regarded as an essential step to avoid later potential biases of the trained models, and whose main conclusions are presented here. With respect to the proposed approaches, a closed-condition system employing a deep complex convolutional recurrent architecture was developed for Track 1, although, unfortunately, no noteworthy results were achieved. On the other hand, different possibilities of open-condition systems, based on leveraging self-supervised models, augmented training data from previous challenges, and novel vocoders, were explored for both tracks, finally achieving very competitive results with an ensemble system.

ASASVIcomtech: The Vicomtech-UGR Speech Deepfake Detection and SASV Systems for the ASVspoof5 Challenge

TL;DR

This work analyzes the ASVspoof5 dataset to mitigate biases and designs both closed- and open-condition systems for spoofing detection and SASV. The closed-condition track employs a DCCRN-based model on full-spectrum STFT features but yields limited generalization, while the open-condition tracks achieve strong performance through ensemble self-supervised learning models (Wav2Vec2-Large, WavLM-Base) and data augmentation with vocoders. A Siamese-like fusion approach combines CM and ASV outputs via calibrated LLRs and non-linear fusion (LSE), delivering competitive Track 1 minDCF/EER and Track 2 a-DCF results, demonstrating robustness to a range of spoofing attacks and codecs. Calibration and diverse pretraining data prove pivotal for effective SASV integration, suggesting practical impact for robust, real-world anti-spoofing and speaker verification systems. Future work will address state-of-the-art TTS/VC robustness, broader codec/narrowband augmentation, and adversarial defenses to further strengthen system reliability.

Abstract

This paper presents the work carried out by the ASASVIcomtech team, made up of researchers from Vicomtech and University of Granada, for the ASVspoof5 Challenge. The team has participated in both Track 1 (speech deepfake detection) and Track 2 (spoofing-aware speaker verification). This work started with an analysis of the challenge available data, which was regarded as an essential step to avoid later potential biases of the trained models, and whose main conclusions are presented here. With respect to the proposed approaches, a closed-condition system employing a deep complex convolutional recurrent architecture was developed for Track 1, although, unfortunately, no noteworthy results were achieved. On the other hand, different possibilities of open-condition systems, based on leveraging self-supervised models, augmented training data from previous challenges, and novel vocoders, were explored for both tracks, finally achieving very competitive results with an ensemble system.
Paper Structure (18 sections, 7 figures, 6 tables)

This paper contains 18 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Histograms of utterance duration from the training set.
  • Figure 2: Histograms of utterance duration from the development set.
  • Figure 3: Histograms of utterance delay from the training set.
  • Figure 4: Histograms of utterance delay from the development set.
  • Figure 5: Histograms of P.563 scores for training utterances across different attack types.
  • ...and 2 more figures