Table of Contents
Fetching ...

Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0

Taein Kang, Soyul Han, Sunmook Choi, Jaejin Seo, Sanghyeok Chung, Seungeun Lee, Seungsang Oh, Il-Youp Kwak

TL;DR

The paper tackles the need for more effective audio representations in spoofing detection by evaluating wav2vec 2.0 as a front-end feature extractor. It introduces two hyperparameters, the total number of Transformer layers $\text{TTL}$ and the number of frozen layers $\text{FTL}$, and optimizes by freezing the leftmost $\text{FTL}$ layers while fine-tuning the remaining $\text{TTL}-\text{FTL}$ layers. Across five back-ends and multiple wav2vec 2.0 front-ends, the study demonstrates consistent performance gains on the ASVspoof 2019 LA dataset, with the best configuration achieving $\text{EER}=0.12\%$ and $\text{t-DCF}=0.0032$ using XLS-R(1B) front-end with RawNet2, and $\text{EER}=0.22\%$ with AASIST. The results provide actionable guidance for domain adaptation in spoofing detection and show that tailored transformer-layer finetuning of wav2vec 2.0 can surpass handcrafted features in robustness and accuracy.

Abstract

Conventional spoofing detection systems have heavily relied on the use of handcrafted features derived from speech data. However, a notable shift has recently emerged towards the direct utilization of raw speech waveforms, as demonstrated by methods like SincNet filters. This shift underscores the demand for more sophisticated audio sample features. Moreover, the success of deep learning models, particularly those utilizing large pretrained wav2vec 2.0 as a featurization front-end, highlights the importance of refined feature encoders. In response, this research assessed the representational capability of wav2vec 2.0 as an audio feature extractor, modifying the size of its pretrained Transformer layers through two key adjustments: (1) selecting a subset of layers starting from the leftmost one and (2) fine-tuning a portion of the selected layers from the rightmost one. We complemented this analysis with five spoofing detection back-end models, with a primary focus on AASIST, enabling us to pinpoint the optimal configuration for the selection and fine-tuning process. In contrast to conventional handcrafted features, our investigation identified several spoofing detection systems that achieve state-of-the-art performance in the ASVspoof 2019 LA dataset. This comprehensive exploration offers valuable insights into feature selection strategies, advancing the field of spoofing detection.

Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0

TL;DR

The paper tackles the need for more effective audio representations in spoofing detection by evaluating wav2vec 2.0 as a front-end feature extractor. It introduces two hyperparameters, the total number of Transformer layers and the number of frozen layers , and optimizes by freezing the leftmost layers while fine-tuning the remaining layers. Across five back-ends and multiple wav2vec 2.0 front-ends, the study demonstrates consistent performance gains on the ASVspoof 2019 LA dataset, with the best configuration achieving and using XLS-R(1B) front-end with RawNet2, and with AASIST. The results provide actionable guidance for domain adaptation in spoofing detection and show that tailored transformer-layer finetuning of wav2vec 2.0 can surpass handcrafted features in robustness and accuracy.

Abstract

Conventional spoofing detection systems have heavily relied on the use of handcrafted features derived from speech data. However, a notable shift has recently emerged towards the direct utilization of raw speech waveforms, as demonstrated by methods like SincNet filters. This shift underscores the demand for more sophisticated audio sample features. Moreover, the success of deep learning models, particularly those utilizing large pretrained wav2vec 2.0 as a featurization front-end, highlights the importance of refined feature encoders. In response, this research assessed the representational capability of wav2vec 2.0 as an audio feature extractor, modifying the size of its pretrained Transformer layers through two key adjustments: (1) selecting a subset of layers starting from the leftmost one and (2) fine-tuning a portion of the selected layers from the rightmost one. We complemented this analysis with five spoofing detection back-end models, with a primary focus on AASIST, enabling us to pinpoint the optimal configuration for the selection and fine-tuning process. In contrast to conventional handcrafted features, our investigation identified several spoofing detection systems that achieve state-of-the-art performance in the ASVspoof 2019 LA dataset. This comprehensive exploration offers valuable insights into feature selection strategies, advancing the field of spoofing detection.
Paper Structure (10 sections, 1 figure, 4 tables)

This paper contains 10 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Proposed frameworks: pretrained wav2vec-based model as a featurization front-end, with the flexibility to fine-tune several Transformer layers, and followed by five distinct spoofing detection back-ends.