Table of Contents
Fetching ...

Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes

Zhou Feng, Jiahao Chen, Chunyi Zhou, Yuwen Pu, Qingming Li, Tianyu Du, Shouling Ji

TL;DR

Enkidu tackles the privacy threat posed by voice deepfakes by introducing universal frequential perturbations learned with few-shot data. The method uses a two-stage approach: stage I optimizes a compact complex-valued perturbation in the frequency domain, while stage II deploys it in real time via a lightweight Tiler that tiles the perturbation over spectrogram frames, enabling length-agnostic privacy protection with minimal perceptual distortion. Across six TTS and five ASV models and multilingual datasets, Enkidu achieves strong SPR and DPR with high intelligibility (MOS ~3.01, STOI ~0.71) and real-time performance (RTC < 0.001, memory < 70 MB, Tiler ~4 MB). The framework demonstrates robust transferability, black-box applicability, and practicality for edge devices, representing a new state-of-the-art in universal, efficient audio privacy defense against voice deepfakes.

Abstract

The rapid advancement of voice deepfake technologies has raised serious concerns about user audio privacy, as attackers increasingly exploit publicly available voice data to generate convincing fake audio for malicious purposes such as identity theft, financial fraud, and misinformation campaigns. While existing defense methods offer partial protection, they face critical limitations, including weak adaptability to unseen user data, poor scalability to long audio, rigid reliance on white-box knowledge, and high computational and temporal costs during the encryption process. To address these challenges and defend against personalized voice deepfake threats, we propose Enkidu, a novel user-oriented privacy-preserving framework that leverages universal frequential perturbations generated through black-box knowledge and few-shot training on a small amount of user data. These highly malleable frequency-domain noise patches enable real-time, lightweight protection with strong generalization across variable-length audio and robust resistance to voice deepfake attacks, all while preserving perceptual quality and speech intelligibility. Notably, Enkidu achieves over 50 to 200 times processing memory efficiency (as low as 0.004 gigabytes) and 3 to 7000 times runtime efficiency (real-time coefficient as low as 0.004) compared to six state-of-the-art countermeasures. Extensive experiments across six mainstream text-to-speech models and five cutting-edge automated speaker verification models demonstrate the effectiveness, transferability, and practicality of Enkidu in defending against both vanilla and adaptive voice deepfake attacks. Our code is currently available.

Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes

TL;DR

Enkidu tackles the privacy threat posed by voice deepfakes by introducing universal frequential perturbations learned with few-shot data. The method uses a two-stage approach: stage I optimizes a compact complex-valued perturbation in the frequency domain, while stage II deploys it in real time via a lightweight Tiler that tiles the perturbation over spectrogram frames, enabling length-agnostic privacy protection with minimal perceptual distortion. Across six TTS and five ASV models and multilingual datasets, Enkidu achieves strong SPR and DPR with high intelligibility (MOS ~3.01, STOI ~0.71) and real-time performance (RTC < 0.001, memory < 70 MB, Tiler ~4 MB). The framework demonstrates robust transferability, black-box applicability, and practicality for edge devices, representing a new state-of-the-art in universal, efficient audio privacy defense against voice deepfakes.

Abstract

The rapid advancement of voice deepfake technologies has raised serious concerns about user audio privacy, as attackers increasingly exploit publicly available voice data to generate convincing fake audio for malicious purposes such as identity theft, financial fraud, and misinformation campaigns. While existing defense methods offer partial protection, they face critical limitations, including weak adaptability to unseen user data, poor scalability to long audio, rigid reliance on white-box knowledge, and high computational and temporal costs during the encryption process. To address these challenges and defend against personalized voice deepfake threats, we propose Enkidu, a novel user-oriented privacy-preserving framework that leverages universal frequential perturbations generated through black-box knowledge and few-shot training on a small amount of user data. These highly malleable frequency-domain noise patches enable real-time, lightweight protection with strong generalization across variable-length audio and robust resistance to voice deepfake attacks, all while preserving perceptual quality and speech intelligibility. Notably, Enkidu achieves over 50 to 200 times processing memory efficiency (as low as 0.004 gigabytes) and 3 to 7000 times runtime efficiency (real-time coefficient as low as 0.004) compared to six state-of-the-art countermeasures. Extensive experiments across six mainstream text-to-speech models and five cutting-edge automated speaker verification models demonstrate the effectiveness, transferability, and practicality of Enkidu in defending against both vanilla and adaptive voice deepfake attacks. Our code is currently available.

Paper Structure

This paper contains 51 sections, 14 equations, 7 figures, 7 tables, 2 algorithms.

Figures (7)

  • Figure 1: The real-life threat of voice deepfake misusage.
  • Figure 2: Threat model. The Enkidu generates optimized UFP and attaches it to user audio in real time. The protected audio maintains naturalness for human listeners and transcription accuracy, while degrading performance of malicious TTS-based voice mimicry, thus preventing misuse.
  • Figure 3: UFP efficiency analysis. Across varying audio durations (1–100 seconds at 16kHz).
  • Figure 4: Ablation results under different Frame Length settings across ASV models. Both SPR (bar) and DPR (line) are visualized to highlight trade-offs in temporal perturbation granularity.
  • Figure 5: Ablation results under different Train Ratios. Even with limited training data, Enkidu achieves strong SPR/DPR, and performance scales consistently with increased data availability.
  • ...and 2 more figures