Table of Contents
Fetching ...

Toward Robust Real-World Audio Deepfake Detection: Closing the Explainability Gap

Georgia Channing, Juil Sock, Ronald Clark, Philip Torr, Christian Schroeder de Witt

TL;DR

Novel explainability methods for state-of-the-art transformer-based audio deepfake detectors are introduced and a novel benchmark for real-world generalizability is open-source for real-world generalizability.

Abstract

The rapid proliferation of AI-manipulated or generated audio deepfakes poses serious challenges to media integrity and election security. Current AI-driven detection solutions lack explainability and underperform in real-world settings. In this paper, we introduce novel explainability methods for state-of-the-art transformer-based audio deepfake detectors and open-source a novel benchmark for real-world generalizability. By narrowing the explainability gap between transformer-based audio deepfake detectors and traditional methods, our results not only build trust with human experts, but also pave the way for unlocking the potential of citizen intelligence to overcome the scalability issue in audio deepfake detection.

Toward Robust Real-World Audio Deepfake Detection: Closing the Explainability Gap

TL;DR

Novel explainability methods for state-of-the-art transformer-based audio deepfake detectors are introduced and a novel benchmark for real-world generalizability is open-source for real-world generalizability.

Abstract

The rapid proliferation of AI-manipulated or generated audio deepfakes poses serious challenges to media integrity and election security. Current AI-driven detection solutions lack explainability and underperform in real-world settings. In this paper, we introduce novel explainability methods for state-of-the-art transformer-based audio deepfake detectors and open-source a novel benchmark for real-world generalizability. By narrowing the explainability gap between transformer-based audio deepfake detectors and traditional methods, our results not only build trust with human experts, but also pave the way for unlocking the potential of citizen intelligence to overcome the scalability issue in audio deepfake detection.

Paper Structure

This paper contains 41 sections, 11 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Diagram of the audio spectrogram transformer architecture introduced by gong2021astaudiospectrogramtransformer
  • Figure 2: GBDT feature importances as measured by mean accuracy decrease with standard deviations for $6.0$-second classifier.
  • Figure 3: GBDT feature correlations and clusters for $6.0$-second classifier. In these figures, sc refers to the spectral centroid, sb refers to the spectral bandwidth, cr refers to the ZCR, mfcc$i$ refers to the $i$-th MFCC feature, and chroma$i$ refers to the $i$-th chroma feature.
  • Figure 4: Importance measured by occlusion for $6.0$-second audio samples.
  • Figure 5: Distribution of attention for $6.0$-second audio samples.
  • ...and 5 more figures