Table of Contents
Fetching ...

Retrieval-Augmented Audio Deepfake Detection

Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang

TL;DR

Retrieval-Augmented Detection (RAD) addresses the vulnerability of relying on a single model for audio deepfake detection by augmenting test samples with retrieved bonafide examples. The method combines WavLM-based self-supervised features with a retrieval module and a multi-fusion attentive classifier (MFA), forming RAD-MFA that leverages external evidence to improve decision-making. Experiments on ASVspoof 2019 LA and ASVspoof 2021 LA/DF show state-of-the-art performance on the DF subset and competitive results on LA, with ablations highlighting the contributions of RAD, data augmentation (VCTK), fine-tuning, and layer-wise retrieval. The approach also demonstrates interpretability through speaker-consistent retrieval and offers a scalable, knowledge-augmented paradigm for robust audio deepfake detection.

Abstract

With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired by retrieval-augmented generation (RAG), we propose a retrieval-augmented detection (RAD) framework that augments test samples with similar retrieved samples for enhanced detection. We also extend the multi-fusion attentive classifier to integrate it with our proposed RAD framework. Extensive experiments show the superior performance of the proposed RAD framework over baseline methods, achieving state-of-the-art results on the ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets. Further sample analysis indicates that the retriever consistently retrieves samples mostly from the same speaker with acoustic characteristics highly consistent with the query audio, thereby improving detection performance.

Retrieval-Augmented Audio Deepfake Detection

TL;DR

Retrieval-Augmented Detection (RAD) addresses the vulnerability of relying on a single model for audio deepfake detection by augmenting test samples with retrieved bonafide examples. The method combines WavLM-based self-supervised features with a retrieval module and a multi-fusion attentive classifier (MFA), forming RAD-MFA that leverages external evidence to improve decision-making. Experiments on ASVspoof 2019 LA and ASVspoof 2021 LA/DF show state-of-the-art performance on the DF subset and competitive results on LA, with ablations highlighting the contributions of RAD, data augmentation (VCTK), fine-tuning, and layer-wise retrieval. The approach also demonstrates interpretability through speaker-consistent retrieval and offers a scalable, knowledge-augmented paradigm for robust audio deepfake detection.

Abstract

With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired by retrieval-augmented generation (RAG), we propose a retrieval-augmented detection (RAD) framework that augments test samples with similar retrieved samples for enhanced detection. We also extend the multi-fusion attentive classifier to integrate it with our proposed RAD framework. Extensive experiments show the superior performance of the proposed RAD framework over baseline methods, achieving state-of-the-art results on the ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets. Further sample analysis indicates that the retriever consistently retrieves samples mostly from the same speaker with acoustic characteristics highly consistent with the query audio, thereby improving detection performance.
Paper Structure (16 sections, 3 equations, 6 figures, 4 tables)

This paper contains 16 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The overview of traditional frameworks, and our proposed framework for audio deepfake detection. (1) shows the pipeline framework. (2) shows the end-to-end framework. (ours) shows our proposed retrieval augmented-based detection (RAD) framework.
  • Figure 2: The baseline structure for fine-tuning.
  • Figure 3: The overview of the RAG and RAD pipeline. Triangular edge rectangles represent vectors for retrieval databases. In RAG, long rectangles represent document chunks. In RAD, long rectangles with/without an outline represent long/short features, rounded edge rectangles represent audio segments.
  • Figure 4: Properties of RAG, RAD, and full training/fine-tuning for detection. Red text represents the focused attention, and green cells represent ideas that should be verified in this paper.
  • Figure 5: The structure of detection model architecture. $\oplus$ denotes the concatenation. This process illustrates the $3^\mathrm{rd}$get results stage of Figure \ref{['fig:ragd']}-RAD in detail.
  • ...and 1 more figures