Table of Contents
Fetching ...

Exploring Green AI for Audio Deepfake Detection

Subhajit Saha, Md Sahidullah, Swagatam Das

TL;DR

This study presents a novel framework for audio deepfake detection that can be seamlessly trained using standard CPU resources and exploits classical machine learning algorithms such as logistic regression and shallow neural networks using the SSL embeddings extracted using the pre-trained model.

Abstract

The state-of-the-art audio deepfake detectors leveraging deep neural networks exhibit impressive recognition performance. Nonetheless, this advantage is accompanied by a significant carbon footprint. This is mainly due to the use of high-performance computing with accelerators and high training time. Studies show that average deep NLP model produces around 626k lbs of CO\textsubscript{2} which is equivalent to five times of average US car emission at its lifetime. This is certainly a massive threat to the environment. To tackle this challenge, this study presents a novel framework for audio deepfake detection that can be seamlessly trained using standard CPU resources. Our proposed framework utilizes off-the-shelve self-supervised learning (SSL) based models which are pre-trained and available in public repositories. In contrast to existing methods that fine-tune SSL models and employ additional deep neural networks for downstream tasks, we exploit classical machine learning algorithms such as logistic regression and shallow neural networks using the SSL embeddings extracted using the pre-trained model. Our approach shows competitive results compared to the commonly used high-carbon footprint approaches. In experiments with the ASVspoof 2019 LA dataset, we achieve a 0.90\% equal error rate (EER) with less than 1k trainable model parameters. To encourage further research in this direction and support reproducible results, the Python code will be made publicly accessible following acceptance. Github: https://github.com/sahasubhajit/Speech-Spoofing-

Exploring Green AI for Audio Deepfake Detection

TL;DR

This study presents a novel framework for audio deepfake detection that can be seamlessly trained using standard CPU resources and exploits classical machine learning algorithms such as logistic regression and shallow neural networks using the SSL embeddings extracted using the pre-trained model.

Abstract

The state-of-the-art audio deepfake detectors leveraging deep neural networks exhibit impressive recognition performance. Nonetheless, this advantage is accompanied by a significant carbon footprint. This is mainly due to the use of high-performance computing with accelerators and high training time. Studies show that average deep NLP model produces around 626k lbs of CO\textsubscript{2} which is equivalent to five times of average US car emission at its lifetime. This is certainly a massive threat to the environment. To tackle this challenge, this study presents a novel framework for audio deepfake detection that can be seamlessly trained using standard CPU resources. Our proposed framework utilizes off-the-shelve self-supervised learning (SSL) based models which are pre-trained and available in public repositories. In contrast to existing methods that fine-tune SSL models and employ additional deep neural networks for downstream tasks, we exploit classical machine learning algorithms such as logistic regression and shallow neural networks using the SSL embeddings extracted using the pre-trained model. Our approach shows competitive results compared to the commonly used high-carbon footprint approaches. In experiments with the ASVspoof 2019 LA dataset, we achieve a 0.90\% equal error rate (EER) with less than 1k trainable model parameters. To encourage further research in this direction and support reproducible results, the Python code will be made publicly accessible following acceptance. Github: https://github.com/sahasubhajit/Speech-Spoofing-
Paper Structure (20 sections, 1 equation, 4 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 1 equation, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Schematic diagram of the proposed framework illustrating SSL training (in red) and downstream training (in green).
  • Figure 2: t-SNE plots showing embeddings from both the feature encoder output (left) and the 12th layer (last transformer) (right), based on the training set of ASVspoof 2019 LA subset consisting of both bonafide and spoof classes.
  • Figure 3: Boxplots of EERs (left) and F1 scores (right) for the six methods. Each boxplot summarizes the EERs computed across 13 different layers.
  • Figure 4: Boxplots of EERs and F1 scores for each layer. Each boxplot summarizes the EERs computed across six downsteam classifiers.