Diffuse or Confuse: A Diffusion Deepfake Speech Dataset
Anton Firc, Kamil Malinka, Petr Hanáček
TL;DR
This work evaluates diffusion-based speech synthesis as a new avenue for deepfake creation and releases a diffusion-generated deepfake speech dataset derived from LJSpeech. It compares diffusion and non-diffusion synthesis across multiple detectors trained on ASVSpoof2019 data, finding that detection performance is broadly similar between diffusion and non-diffusion deepfakes, with some model-dependent variability. The study also investigates re-vocoding non-diffusion deepfakes using diffusion vocoders and assesses standard speech-quality metrics (WER, PESQ, SNR, speaker similarity), noting that diffusion introduces more noise while maintaining comparable overall quality. The dataset and initial findings support future security research, suggesting detector ensembles and diffusion-noise cues as promising directions for robust deepfake detection.
Abstract
Advancements in artificial intelligence and machine learning have significantly improved synthetic speech generation. This paper explores diffusion models, a novel method for creating realistic synthetic speech. We create a diffusion dataset using available tools and pretrained models. Additionally, this study assesses the quality of diffusion-generated deepfakes versus non-diffusion ones and their potential threat to current deepfake detection systems. Findings indicate that the detection of diffusion-based deepfakes is generally comparable to non-diffusion deepfakes, with some variability based on detector architecture. Re-vocoding with diffusion vocoders shows minimal impact, and the overall speech quality is comparable to non-diffusion methods.
