Table of Contents
Fetching ...

SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection

Jiangyan Yi, Chenglong Wang, Jianhua Tao, Chu Yuan Zhang, Cunhang Fan, Zhengkun Tian, Haoxin Ma, Ruibo Fu

TL;DR

SceneFake addresses a previously underexplored threat in audio forensics: scene-manipulated speech created by removing an original acoustic scene and overlaying a forged one via speech-enhancement techniques. The authors construct a dataset by fusing ASVspoof 2019 LA data with DCASE acoustic scenes, establishing training, development, seen-test, and unseen-test splits and evaluating multiple speech-enhancement methods and baseline detectors using $EER$ and $t$-$DCF$. They report that state-of-the-art LA spoof detectors struggle to generalize to scene-forgery, especially on unseen conditions, and demonstrate the impact of different SE models on detection performance. The dataset and benchmarks, publicly available, aim to spur robust, generalizable scene-forgery detection and highlight the need for more realistic, diverse, and language-agnostic evaluation, with future work including richer manipulations and interpretability analyses.

Abstract

Many datasets have been designed to further the development of fake audio detection. However, fake utterances in previous datasets are mostly generated by altering timbre, prosody, linguistic content or channel noise of original audio. These datasets leave out a scenario, in which the acoustic scene of an original audio is manipulated with a forged one. It will pose a major threat to our society if some people misuse the manipulated audio with malicious purpose. Therefore, this motivates us to fill in the gap. This paper proposes such a dataset for scene fake audio detection named SceneFake, where a manipulated audio is generated by only tampering with the acoustic scene of an real utterance by using speech enhancement technologies. Some scene fake audio detection benchmark results on the SceneFake dataset are reported in this paper. In addition, an analysis of fake attacks with different speech enhancement technologies and signal-to-noise ratios are presented in this paper. The results indicate that scene fake utterances cannot be reliably detected by baseline models trained on the ASVspoof 2019 dataset. Although these models perform well on the SceneFake training set and seen testing set, their performance is poor on the unseen test set. The dataset (https://zenodo.org/record/7663324#.Y_XKMuPYuUk) and benchmark source codes (https://github.com/ADDchallenge/SceneFake) are publicly available.

SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection

TL;DR

SceneFake addresses a previously underexplored threat in audio forensics: scene-manipulated speech created by removing an original acoustic scene and overlaying a forged one via speech-enhancement techniques. The authors construct a dataset by fusing ASVspoof 2019 LA data with DCASE acoustic scenes, establishing training, development, seen-test, and unseen-test splits and evaluating multiple speech-enhancement methods and baseline detectors using and -. They report that state-of-the-art LA spoof detectors struggle to generalize to scene-forgery, especially on unseen conditions, and demonstrate the impact of different SE models on detection performance. The dataset and benchmarks, publicly available, aim to spur robust, generalizable scene-forgery detection and highlight the need for more realistic, diverse, and language-agnostic evaluation, with future work including richer manipulations and interpretability analyses.

Abstract

Many datasets have been designed to further the development of fake audio detection. However, fake utterances in previous datasets are mostly generated by altering timbre, prosody, linguistic content or channel noise of original audio. These datasets leave out a scenario, in which the acoustic scene of an original audio is manipulated with a forged one. It will pose a major threat to our society if some people misuse the manipulated audio with malicious purpose. Therefore, this motivates us to fill in the gap. This paper proposes such a dataset for scene fake audio detection named SceneFake, where a manipulated audio is generated by only tampering with the acoustic scene of an real utterance by using speech enhancement technologies. Some scene fake audio detection benchmark results on the SceneFake dataset are reported in this paper. In addition, an analysis of fake attacks with different speech enhancement technologies and signal-to-noise ratios are presented in this paper. The results indicate that scene fake utterances cannot be reliably detected by baseline models trained on the ASVspoof 2019 dataset. Although these models perform well on the SceneFake training set and seen testing set, their performance is poor on the unseen test set. The dataset (https://zenodo.org/record/7663324#.Y_XKMuPYuUk) and benchmark source codes (https://github.com/ADDchallenge/SceneFake) are publicly available.
Paper Structure (20 sections, 3 equations, 6 figures, 15 tables)

This paper contains 20 sections, 3 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Spectrogram of an example utterance "What do we want to do that for?". It involves rich information, such as timbre trait, prosody feature, linguistic content, channel noise, acoustic scene and other information. The acoustic scene of the utterance is Airport.
  • Figure 2: Waveforms of example utterances. (a) illustrates a real utterance involving a scene, such as "Airport". (b) shows a fake utterance: the scene of the real utterance is manipulated with another scene, such as "Public square".
  • Figure 3: An example of acoustic scene manipulation for a fake utterance. The manipulation procedure consists of two steps: 1. Enhancing the real speech involving a scene, such as "Airport". 2. Adding another scene to the enhanced speech, such as "Street". The signal noise ratio (SNR) of the real utterance is denoted by SNR$_{real}$. The SNR of the fake utterance is referred to as SNR$_{fake}$. The SNR$_{real}$ and SNR$_{fake}$ are both 5dB in the example.
  • Figure 4: Data structure of the SceneFake dataset. It consists of five sets: training, development, seen test, unseen test 1 and unseen test 2 sets.
  • Figure 5: Spectrogram examples of utterances at -5dB and 20dB in the seen test set. "Clean" denotes the clean utterance. "Real" denotes the real utterances simulated by adding acoustic scene "Airport" to the clean utterance. "SSub", "MMSE", "Weiner", "FullSubNet" denote the respective enhanced utterance with the respective speech enhancement model. "Fake" denotes the respective fake utterance generated by adding scenes "Public square" to the respective enhanced speech.
  • ...and 1 more figures