Table of Contents
Fetching ...

One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection

Hyun Myung Kim, Kangwook Jang, Hoirin Kim

TL;DR

A novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations, outperforming all existing systems on the ASVspoof 2021 deepfake dataset.

Abstract

As speech synthesis systems continue to make remarkable advances in recent years, the importance of robust deepfake detection systems that perform well in unseen systems has grown. In this paper, we propose a novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations. Our approach uses only bonafide samples to define their centroid, which can yield a specialized centroid for one-class learning. Integrating our ACS with one-class learning gathers bonafide representations into a single cluster, forming well-separated embeddings robust to unseen spoofing attacks. Our proposed method achieves an equal error rate (EER) of 2.19% on the ASVspoof 2021 deepfake dataset, outperforming all existing systems. Furthermore, the t-SNE visualization illustrates that our method effectively maps the bonafide embeddings into a single cluster and successfully disentangles the bonafide and spoof classes.

One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection

TL;DR

A novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations, outperforming all existing systems on the ASVspoof 2021 deepfake dataset.

Abstract

As speech synthesis systems continue to make remarkable advances in recent years, the importance of robust deepfake detection systems that perform well in unseen systems has grown. In this paper, we propose a novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations. Our approach uses only bonafide samples to define their centroid, which can yield a specialized centroid for one-class learning. Integrating our ACS with one-class learning gathers bonafide representations into a single cluster, forming well-separated embeddings robust to unseen spoofing attacks. Our proposed method achieves an equal error rate (EER) of 2.19% on the ASVspoof 2021 deepfake dataset, outperforming all existing systems. Furthermore, the t-SNE visualization illustrates that our method effectively maps the bonafide embeddings into a single cluster and successfully disentangles the bonafide and spoof classes.

Paper Structure

This paper contains 14 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Illustration of the ACS method when the $(i+1)$-th minibatch is the input and each minibatch contains one bonafide sample. All dashes show their previous states, and the optimization movements are represented by arrows.
  • Figure 2: The pipeline of our proposed model. The blue boxes indicate the ASP module. The frame-level feature is extracted from the XLS-R feature encoder, and the utterance-level feature is obtained through the ASP module.
  • Figure 3: Visualization of the embedding space using t-SNE on the ASVspoof 2021 LA evaluation dataset.