Table of Contents
Fetching ...

SafeEar: Content Privacy-Preserving Audio Deepfake Detection

Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu

TL;DR

SafeEar tackles the privacy risk of audio deepfake detection by decoupling speech into semantic and acoustic tokens and performing detection on acoustic-only representations. It introduces a neural codec-based decoupling model (CDM) with HuBERT-guided semantic quantization and a shuffled acoustic token stream, augmented by real-world codecs to bridge training and deployment gaps. The approach yields competitive deepfake detection accuracy ($EER$ as low as $2.02\%$ on multilingual data) while achieving robust content protection, with $WER$ above $93.93\%$ and STOI near zero against content-recovery attempts. The work provides a practical benchmark (CVoiceFake) and demonstrates a pathway toward privacy-preserving audio analytics suitable for local or third-party deployment.

Abstract

Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets. In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar's effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.

SafeEar: Content Privacy-Preserving Audio Deepfake Detection

TL;DR

SafeEar tackles the privacy risk of audio deepfake detection by decoupling speech into semantic and acoustic tokens and performing detection on acoustic-only representations. It introduces a neural codec-based decoupling model (CDM) with HuBERT-guided semantic quantization and a shuffled acoustic token stream, augmented by real-world codecs to bridge training and deployment gaps. The approach yields competitive deepfake detection accuracy ( as low as on multilingual data) while achieving robust content protection, with above and STOI near zero against content-recovery attempts. The work provides a practical benchmark (CVoiceFake) and demonstrates a pathway toward privacy-preserving audio analytics suitable for local or third-party deployment.

Abstract

Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets. In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar's effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.
Paper Structure (42 sections, 10 equations, 9 figures, 12 tables)

This paper contains 42 sections, 10 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: SafeEar framework decouples speech samples into semantic and acoustic information. By using acoustic-only information, SafeEar achieves reliable deepfake detection while protecting user content privacy from recovery attacks.
  • Figure 2: Mainstream solutions on audio deepfake detection: pipeline and end-to-end detector.
  • Figure 3: Overview of the SafeEar framework. In the inference phase, we just need to remove ④.
  • Figure 4: Frontend codec-based decoupling model (①) of SafeEar.
  • Figure 5: Bottlneck & Shuffle layers (②) of SafeEar.
  • ...and 4 more figures