Table of Contents
Fetching ...

CLAD: Robust Audio Deepfake Detection Against Manipulation Attacks with Contrastive Learning

Haolin Wu, Jing Chen, Ruiying Du, Cong Wu, Kun He, Xingcan Shang, Hao Ren, Guowen Xu

TL;DR

This work analyzes how simple audio manipulations can degrade state-of-the-art deepfake detectors and introduces CLAD, a robust detector built on MoCo-style contrastive learning augmented with a length-based loss. CLAD trains a robust audio encoder to produce invariant representations across manipulations and clusters real audios to improve downstream discrimination, achieving FARs below 1.63% across manipulation types and as low as 0.12% under white-noise conditions. The approach is detector-agnostic and can be plugged into existing systems, with extensive ablations confirming the importance of both contrastive learning and length loss. The findings suggest practical pathways to harden audio deepfake detection in real-world settings and provide the community with an effective, reusable defense tool along with open-source code.

Abstract

The increasing prevalence of audio deepfakes poses significant security threats, necessitating robust detection methods. While existing detection systems exhibit promise, their robustness against malicious audio manipulations remains underexplored. To bridge the gap, we undertake the first comprehensive study of the susceptibility of the most widely adopted audio deepfake detectors to manipulation attacks. Surprisingly, even manipulations like volume control can significantly bypass detection without affecting human perception. To address this, we propose CLAD (Contrastive Learning-based Audio deepfake Detector) to enhance the robustness against manipulation attacks. The key idea is to incorporate contrastive learning to minimize the variations introduced by manipulations, therefore enhancing detection robustness. Additionally, we incorporate a length loss, aiming to improve the detection accuracy by clustering real audios more closely in the feature space. We comprehensively evaluated the most widely adopted audio deepfake detection models and our proposed CLAD against various manipulation attacks. The detection models exhibited vulnerabilities, with FAR rising to 36.69%, 31.23%, and 51.28% under volume control, fading, and noise injection, respectively. CLAD enhanced robustness, reducing the FAR to 0.81% under noise injection and consistently maintaining an FAR below 1.63% across all tests. Our source code and documentation are available in the artifact repository (https://github.com/CLAD23/CLAD).

CLAD: Robust Audio Deepfake Detection Against Manipulation Attacks with Contrastive Learning

TL;DR

This work analyzes how simple audio manipulations can degrade state-of-the-art deepfake detectors and introduces CLAD, a robust detector built on MoCo-style contrastive learning augmented with a length-based loss. CLAD trains a robust audio encoder to produce invariant representations across manipulations and clusters real audios to improve downstream discrimination, achieving FARs below 1.63% across manipulation types and as low as 0.12% under white-noise conditions. The approach is detector-agnostic and can be plugged into existing systems, with extensive ablations confirming the importance of both contrastive learning and length loss. The findings suggest practical pathways to harden audio deepfake detection in real-world settings and provide the community with an effective, reusable defense tool along with open-source code.

Abstract

The increasing prevalence of audio deepfakes poses significant security threats, necessitating robust detection methods. While existing detection systems exhibit promise, their robustness against malicious audio manipulations remains underexplored. To bridge the gap, we undertake the first comprehensive study of the susceptibility of the most widely adopted audio deepfake detectors to manipulation attacks. Surprisingly, even manipulations like volume control can significantly bypass detection without affecting human perception. To address this, we propose CLAD (Contrastive Learning-based Audio deepfake Detector) to enhance the robustness against manipulation attacks. The key idea is to incorporate contrastive learning to minimize the variations introduced by manipulations, therefore enhancing detection robustness. Additionally, we incorporate a length loss, aiming to improve the detection accuracy by clustering real audios more closely in the feature space. We comprehensively evaluated the most widely adopted audio deepfake detection models and our proposed CLAD against various manipulation attacks. The detection models exhibited vulnerabilities, with FAR rising to 36.69%, 31.23%, and 51.28% under volume control, fading, and noise injection, respectively. CLAD enhanced robustness, reducing the FAR to 0.81% under noise injection and consistently maintaining an FAR below 1.63% across all tests. Our source code and documentation are available in the artifact repository (https://github.com/CLAD23/CLAD).
Paper Structure (22 sections, 5 equations, 9 figures, 6 tables)

This paper contains 22 sections, 5 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Illustration of the manipulation attacks. Though detection model performs well on the original audio, we found that simple manipulations like fading could bypass it.
  • Figure 2: Visualization of origianl vs. manipulated features extracted by widely adopted detection models via t-SNE. (a) and (d) are features extracted by AASIST, (b) and (e) are features extracted by RawNet2, (c) and (f) are features extracted by Res-TSSDNet.
  • Figure 3: Overview of CLAD. (a) illustrates the pretraining stage. (b) illustrates the downstream training stage.
  • Figure 4: Illustration of the motivation of length loss. (a) illustrates the training of contrastive loss. (b) illustrates the features extracted by contrastive loss trained encoder. (c) illustrates the features extracted by length loss and contrastive loss trained encoder.
  • Figure 5: The prediction score distribution of baseline model output for the whole dataset under various types of manipulations.
  • ...and 4 more figures