Table of Contents
Fetching ...

Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

Yuankun Xie, Ruibo Fu, Zhiyong Wang, Xiaopeng Wang, Songjun Cao, Long Ma, Haonan Cheng, Long Ye

TL;DR

This work tackles all-type audio deepfake detection by establishing a comprehensive cross-type benchmark spanning speech, sound, singing, and music, and proposes prompt-based SSL countermeasures to enable universal detection. It introduces PT-SSL-AASIST, which learns task-specific prompts while freezing the SSL backbone, and WPT-SSL-AASIST, which injects wavelet-derived tokens to capture full-frequency deepfake cues without increasing trainable parameters. Co-training across all types yields strong universal performance, with WPT-XLSR-AASIST achieving an average EER of 3.58% on the all-type benchmark, and the analysis reveals type-invariant representations and frequency-focused attention patterns. Overall, the combination of wavelet prompts and co-training provides a practical, efficient pathway to robust all-type ADD in real-world scenarios where audio type is uncertain.

Abstract

The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single-type audio deepfake detection (ADD), their performance declines in cross-type scenarios. This paper is dedicated to studying the all-type ADD task. We are the first to comprehensively establish an all-type ADD benchmark to evaluate current CMs, incorporating cross-type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self-supervised learning (PT-SSL) training paradigm, which optimizes SSL front-end by learning specialized prompt tokens for ADD, requiring 458x fewer trainable parameters than fine-tuning (FT). Considering the auditory perception of different audio types, we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co-training. Experimental results demonstrate that WPT-XLSR-AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets.

Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

TL;DR

This work tackles all-type audio deepfake detection by establishing a comprehensive cross-type benchmark spanning speech, sound, singing, and music, and proposes prompt-based SSL countermeasures to enable universal detection. It introduces PT-SSL-AASIST, which learns task-specific prompts while freezing the SSL backbone, and WPT-SSL-AASIST, which injects wavelet-derived tokens to capture full-frequency deepfake cues without increasing trainable parameters. Co-training across all types yields strong universal performance, with WPT-XLSR-AASIST achieving an average EER of 3.58% on the all-type benchmark, and the analysis reveals type-invariant representations and frequency-focused attention patterns. Overall, the combination of wavelet prompts and co-training provides a practical, efficient pathway to robust all-type ADD in real-world scenarios where audio type is uncertain.

Abstract

The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single-type audio deepfake detection (ADD), their performance declines in cross-type scenarios. This paper is dedicated to studying the all-type ADD task. We are the first to comprehensively establish an all-type ADD benchmark to evaluate current CMs, incorporating cross-type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self-supervised learning (PT-SSL) training paradigm, which optimizes SSL front-end by learning specialized prompt tokens for ADD, requiring 458x fewer trainable parameters than fine-tuning (FT). Considering the auditory perception of different audio types, we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co-training. Experimental results demonstrate that WPT-XLSR-AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets.

Paper Structure

This paper contains 16 sections, 5 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The challenge for current single-type trained CMs toward cross-type ADD task, highlighting the effectiveness of our proposed WPT-SSL CM.
  • Figure 2: Our proposed PT-SSL-AASIST (left) and WPT-SSL-AASIST (right). The differences between PT and WPT are illustrated below. WPT enhances the full-frequency perception of SSL-AASIST by applying DWT to part of the prompt tokens.
  • Figure 3: Different paradigms of PT-SSL-AASIST.
  • Figure 4: Convergence speed of different paradigms.
  • Figure 5: T-SNE visualization for FT-XLSR-AASIST (left) and WPT-XLSR-AASIST (right). Different colors indicate features from different types: blue=speech, green=sound, orange=singing, purple=music. Different shapes represent different categories: cross=real, point=fake.
  • ...and 1 more figures