Table of Contents
Fetching ...

Zero-Day Audio DeepFake Detection via Retrieval Augmentation and Profile Matching

Xuechen Liu, Xin Wang, Junichi Yamagishi

TL;DR

Zero-day audio deepfake attacks undermine detectors trained on known data. The authors present a training-free retrieval augmentation framework that uses a knowledge database combining SSL-based CM features and speaker profile attributes, with k-NN retrieval and ensemble (MV, Ratio, Avg) to detect unseen fakes without fine-tuning. On DE2024 across multiple durations, the approach yields competitive results to supervised baselines and reveals that voice-quality cues are particularly informative, while cross-database tests (DE2024 vs AI4T) expose domain-mismatch challenges for training-free methods. The work demonstrates the practical potential of rapid, training-free defense against evolving audio deepfakes and highlights avenues to improve robustness across domains and temporal features.

Abstract

Modern audio deepfake detectors built on foundation models and large training datasets achieve promising detection performance. However, they struggle with zero-day attacks, where the audio samples are generated by novel synthesis methods that models have not seen from reigning training data. Conventional approaches fine-tune the detector, which can be problematic when prompt response is needed. This paper proposes a training-free retrieval-augmented framework for zero-day audio deepfake detection that leverages knowledge representations and voice profile matching. Within this framework, we propose simple yet effective retrieval and ensemble methods that reach performance comparable to supervised baselines and their fine-tuned counterparts on the DeepFake-Eval-2024 benchmark, without any additional model training. We also conduct ablation on voice profile attributes, and demonstrate the cross-database generalizability of the framework with introducing simple and training-free fusion strategies.

Zero-Day Audio DeepFake Detection via Retrieval Augmentation and Profile Matching

TL;DR

Zero-day audio deepfake attacks undermine detectors trained on known data. The authors present a training-free retrieval augmentation framework that uses a knowledge database combining SSL-based CM features and speaker profile attributes, with k-NN retrieval and ensemble (MV, Ratio, Avg) to detect unseen fakes without fine-tuning. On DE2024 across multiple durations, the approach yields competitive results to supervised baselines and reveals that voice-quality cues are particularly informative, while cross-database tests (DE2024 vs AI4T) expose domain-mismatch challenges for training-free methods. The work demonstrates the practical potential of rapid, training-free defense against evolving audio deepfakes and highlights avenues to improve robustness across domains and temporal features.

Abstract

Modern audio deepfake detectors built on foundation models and large training datasets achieve promising detection performance. However, they struggle with zero-day attacks, where the audio samples are generated by novel synthesis methods that models have not seen from reigning training data. Conventional approaches fine-tune the detector, which can be problematic when prompt response is needed. This paper proposes a training-free retrieval-augmented framework for zero-day audio deepfake detection that leverages knowledge representations and voice profile matching. Within this framework, we propose simple yet effective retrieval and ensemble methods that reach performance comparable to supervised baselines and their fine-tuned counterparts on the DeepFake-Eval-2024 benchmark, without any additional model training. We also conduct ablation on voice profile attributes, and demonstrate the cross-database generalizability of the framework with introducing simple and training-free fusion strategies.

Paper Structure

This paper contains 16 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Demonstration of the vulnerability of deepfake detection models caused by the time lag between the zero-day attack period and model update. The time progression indicates natural time flow.
  • Figure 2: Outline of the methodology. The knowledge source data contains input recordings and ground truth binary labels. The lock symbols indicate that both the CM and profile feature extractors remain frozen without fine-tuning. Retrieved vectors can be either CM features, profile features, or both, depending on the strategy detailed in Section \ref{['secsec:k-NN']}. Example number of retrieved utterances is $k=4$ in this outline.
  • Figure 3: Selective fusion of the proposed training-free deepfake detector and SSL-based CM. The lock symbols indicate the models that are remain frozen without fine-tuning. The text in italian refers to the source data to construct the knowledge databases. We use B00 and S08 as example systems here.