Diffuse or Confuse: A Diffusion Deepfake Speech Dataset

Anton Firc; Kamil Malinka; Petr Hanáček

Diffuse or Confuse: A Diffusion Deepfake Speech Dataset

Anton Firc, Kamil Malinka, Petr Hanáček

TL;DR

This work evaluates diffusion-based speech synthesis as a new avenue for deepfake creation and releases a diffusion-generated deepfake speech dataset derived from LJSpeech. It compares diffusion and non-diffusion synthesis across multiple detectors trained on ASVSpoof2019 data, finding that detection performance is broadly similar between diffusion and non-diffusion deepfakes, with some model-dependent variability. The study also investigates re-vocoding non-diffusion deepfakes using diffusion vocoders and assesses standard speech-quality metrics (WER, PESQ, SNR, speaker similarity), noting that diffusion introduces more noise while maintaining comparable overall quality. The dataset and initial findings support future security research, suggesting detector ensembles and diffusion-noise cues as promising directions for robust deepfake detection.

Abstract

Advancements in artificial intelligence and machine learning have significantly improved synthetic speech generation. This paper explores diffusion models, a novel method for creating realistic synthetic speech. We create a diffusion dataset using available tools and pretrained models. Additionally, this study assesses the quality of diffusion-generated deepfakes versus non-diffusion ones and their potential threat to current deepfake detection systems. Findings indicate that the detection of diffusion-based deepfakes is generally comparable to non-diffusion deepfakes, with some variability based on detector architecture. Re-vocoding with diffusion vocoders shows minimal impact, and the overall speech quality is comparable to non-diffusion methods.

Diffuse or Confuse: A Diffusion Deepfake Speech Dataset

TL;DR

Abstract

Diffuse or Confuse: A Diffusion Deepfake Speech Dataset

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)