Table of Contents
Fetching ...

FakeSound: Deepfake General Audio Detection

Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu

TL;DR

This work targets the rising risk of deepfake general audio by introducing FakeSound, a dataset generated through an automated grounding–masking–regeneration pipeline to produce realistic manipulated audio and three test sets with increasing difficulty. It also proposes a benchmark detector that leverages a general-audio pre-trained backbone with a frame-level representation to jointly identify manipulated content and localize fake regions, evaluated at 1-second and 20-millisecond resolutions. The proposed model consistently outperforms a speech-focused state-of-the-art baseline and surpasses human performance on most tasks, though zero-shot domain adaptation remains challenging. Overall, the study advances general-audio deepfake detection and localization, with practical implications for safeguarding against audio-based misinformation and scams.

Abstract

With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset named FakeSound for deepfake general audio detection is proposed, and samples can be viewed on website https://FakeSoundData.github.io. The average binary accuracy of humans on all test sets is consistently below 0.6, which indicates the difficulty humans face in discerning deepfake audio and affirms the efficacy of the FakeSound dataset. A deepfake detection model utilizing a general audio pre-trained model is proposed as a benchmark system. Experimental results demonstrate that the performance of the proposed model surpasses the state-of-the-art in deepfake speech detection and human testers.

FakeSound: Deepfake General Audio Detection

TL;DR

This work targets the rising risk of deepfake general audio by introducing FakeSound, a dataset generated through an automated grounding–masking–regeneration pipeline to produce realistic manipulated audio and three test sets with increasing difficulty. It also proposes a benchmark detector that leverages a general-audio pre-trained backbone with a frame-level representation to jointly identify manipulated content and localize fake regions, evaluated at 1-second and 20-millisecond resolutions. The proposed model consistently outperforms a speech-focused state-of-the-art baseline and surpasses human performance on most tasks, though zero-shot domain adaptation remains challenging. Overall, the study advances general-audio deepfake detection and localization, with practical implications for safeguarding against audio-based misinformation and scams.

Abstract

With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset named FakeSound for deepfake general audio detection is proposed, and samples can be viewed on website https://FakeSoundData.github.io. The average binary accuracy of humans on all test sets is consistently below 0.6, which indicates the difficulty humans face in discerning deepfake audio and affirms the efficacy of the FakeSound dataset. A deepfake detection model utilizing a general audio pre-trained model is proposed as a benchmark system. Experimental results demonstrate that the performance of the proposed model surpasses the state-of-the-art in deepfake speech detection and human testers.
Paper Structure (15 sections, 3 equations, 2 figures, 2 tables)

This paper contains 15 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Samples of FakeSound, synthesized by the manipulation pipeline. A grounding model locates and masks key regions in the genuine audio, followed by regeneration and replacement by the generation model.
  • Figure 2: Left: Manipulation pipeline. A grounding model locates and masks key regions on genuine audio based on caption information. The generation model regenerates these key regions, replacing them to produce convincing realistic deepfake general audio. Right: Diagram of proposed model, which conducts deepfake detection on input general audio— identifies whether the audio is genuine or deepfake, and locates the deepfake regions.