Table of Contents
Fetching ...

Towards Attention-based Contrastive Learning for Audio Spoof Detection

Chirag Goel, Surya Koppisetti, Ben Colman, Ali Shahriyari, Gaurav Bharaj

TL;DR

This work investigates vision-transformer-based audio spoof detection and finds vanilla SSAST fine-tuning yields suboptimal equal error rates (EER). It introduces SSAST-CL, a two-stage contrastive learning framework with a three-branch Siamese backbone that includes a cross-attention branch, optimized by a loss $L_{con} = L_{SA} + \alpha L_{CA}$ to disentangle bonafide and spoof representations; data augmentations tailored to codec impairments further improve robustness. The approach achieves a substantial EER reduction (e.g., from $19.48$ to $4.74$ on ASVSpoof 2021 LA) and competitive performance against top models while using a smaller footprint. These results demonstrate the viability of attention-based, contrastive learning for robust audio spoof detection and suggest broader applicability to other limited-data audio tasks with codec artifacts.

Abstract

Vision transformers (ViT) have made substantial progress for classification tasks in computer vision. Recently, Gong et. al. '21, introduced attention-based modeling for several audio tasks. However, relatively unexplored is the use of a ViT for audio spoof detection task. We bridge this gap and introduce ViTs for this task. A vanilla baseline built on fine-tuning the SSAST (Gong et. al. '22) audio ViT model achieves sub-optimal equal error rates (EERs). To improve performance, we propose a novel attention-based contrastive learning framework (SSAST-CL) that uses cross-attention to aid the representation learning. Experiments show that our framework successfully disentangles the bonafide and spoof classes and helps learn better classifiers for the task. With appropriate data augmentations policy, a model trained on our framework achieves competitive performance on the ASVSpoof 2021 challenge. We provide comparisons and ablation studies to justify our claim.

Towards Attention-based Contrastive Learning for Audio Spoof Detection

TL;DR

This work investigates vision-transformer-based audio spoof detection and finds vanilla SSAST fine-tuning yields suboptimal equal error rates (EER). It introduces SSAST-CL, a two-stage contrastive learning framework with a three-branch Siamese backbone that includes a cross-attention branch, optimized by a loss to disentangle bonafide and spoof representations; data augmentations tailored to codec impairments further improve robustness. The approach achieves a substantial EER reduction (e.g., from to on ASVSpoof 2021 LA) and competitive performance against top models while using a smaller footprint. These results demonstrate the viability of attention-based, contrastive learning for robust audio spoof detection and suggest broader applicability to other limited-data audio tasks with codec artifacts.

Abstract

Vision transformers (ViT) have made substantial progress for classification tasks in computer vision. Recently, Gong et. al. '21, introduced attention-based modeling for several audio tasks. However, relatively unexplored is the use of a ViT for audio spoof detection task. We bridge this gap and introduce ViTs for this task. A vanilla baseline built on fine-tuning the SSAST (Gong et. al. '22) audio ViT model achieves sub-optimal equal error rates (EERs). To improve performance, we propose a novel attention-based contrastive learning framework (SSAST-CL) that uses cross-attention to aid the representation learning. Experiments show that our framework successfully disentangles the bonafide and spoof classes and helps learn better classifiers for the task. With appropriate data augmentations policy, a model trained on our framework achieves competitive performance on the ASVSpoof 2021 challenge. We provide comparisons and ablation studies to justify our claim.
Paper Structure (12 sections, 1 equation, 2 figures, 3 tables)

This paper contains 12 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: SSAST-CL: A two-stage contrastive learning framework to train the SSAST model gong2022ssast for audio spoof detection. In Stage I, we employ Siamese training with weight-sharing across two multi-head self-attention (MH-SA) and one multi-head cross-attention (MH-CA) branches. Model weights are learned using a contrastive loss which measures the (dis-)similarity between the self and cross-attention representations $(\mathbf{r}_1^{SA}, \mathbf{r}_2^{SA}, \mathbf{r}_{12}^{CA})$. In Stage II, a MLP classifies the learned representations as bonafide or spoof.
  • Figure 2: t-SNE embeddings for the (a) vanilla WCE baseline gong2022ssast and (b-c) the proposed SSAST-CL solution on ASVSpoof 2021 dataset.