Table of Contents
Fetching ...

Listening Between the Lines: Synthetic Speech Detection Disregarding Verbal Content

Davide Salvi, Temesgen Semu Balcha, Paolo Bestagini, Stefano Tubaro

TL;DR

The paper investigates whether synthetic speech detection can be achieved by analyzing background noise rather than spoken content. By decomposing an input signal $\mathbf{x}$ into verbal $\mathbf{s}$ and background $\mathbf{n}$ components via two noise extractors, and training three detectors $\mathcal{D}_{\mathbf{x}}, \mathcal{D}_{\mathbf{s}}, \mathcal{D}_{\mathbf{n}}$ on $\mathbf{x}$, $\mathbf{s}$, and $\mathbf{n}$ respectively, the study finds that the background-only detector $\mathcal{D}_{\mathbf{n}}$ often yields superior detection performance. Across multiple datasets (ASVspoof 2019/2021, AISEC In-the-Wild, FakeOrReal) and under MP3 anti-forensics, $\mathcal{D}_{\mathbf{n}}$ shows robust generalization, while $\mathcal{D}_{\mathbf{x}}$ and $\mathcal{D}_{\mathbf{s}}$ are more variable and sometimes rely on verbal content. These results highlight the importance of high-frequency background artifacts in forensic detection and call for detectors that robustly handle post-processing and cross-dataset variation, with implications for interpretability and detector design.

Abstract

Recent advancements in synthetic speech generation have led to the creation of forged audio data that are almost indistinguishable from real speech. This phenomenon poses a new challenge for the multimedia forensics community, as the misuse of synthetic media can potentially cause adverse consequences. Several methods have been proposed in the literature to mitigate potential risks and detect synthetic speech, mainly focusing on the analysis of the speech itself. However, recent studies have revealed that the most crucial frequency bands for detection lie in the highest ranges (above 6000 Hz), which do not include any speech content. In this work, we extensively explore this aspect and investigate whether synthetic speech detection can be performed by focusing only on the background component of the signal while disregarding its verbal content. Our findings indicate that the speech component is not the predominant factor in performing synthetic speech detection. These insights provide valuable guidance for the development of new synthetic speech detectors and their interpretability, together with some considerations on the existing work in the audio forensics field.

Listening Between the Lines: Synthetic Speech Detection Disregarding Verbal Content

TL;DR

The paper investigates whether synthetic speech detection can be achieved by analyzing background noise rather than spoken content. By decomposing an input signal into verbal and background components via two noise extractors, and training three detectors on , , and respectively, the study finds that the background-only detector often yields superior detection performance. Across multiple datasets (ASVspoof 2019/2021, AISEC In-the-Wild, FakeOrReal) and under MP3 anti-forensics, shows robust generalization, while and are more variable and sometimes rely on verbal content. These results highlight the importance of high-frequency background artifacts in forensic detection and call for detectors that robustly handle post-processing and cross-dataset variation, with implications for interpretability and detector design.

Abstract

Recent advancements in synthetic speech generation have led to the creation of forged audio data that are almost indistinguishable from real speech. This phenomenon poses a new challenge for the multimedia forensics community, as the misuse of synthetic media can potentially cause adverse consequences. Several methods have been proposed in the literature to mitigate potential risks and detect synthetic speech, mainly focusing on the analysis of the speech itself. However, recent studies have revealed that the most crucial frequency bands for detection lie in the highest ranges (above 6000 Hz), which do not include any speech content. In this work, we extensively explore this aspect and investigate whether synthetic speech detection can be performed by focusing only on the background component of the signal while disregarding its verbal content. Our findings indicate that the speech component is not the predominant factor in performing synthetic speech detection. These insights provide valuable guidance for the development of new synthetic speech detectors and their interpretability, together with some considerations on the existing work in the audio forensics field.
Paper Structure (9 sections, 3 equations, 4 figures, 2 tables)

This paper contains 9 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Pipeline of the proposed method.
  • Figure 2: Spectrograms of the signals $\textbf{x}$ (left), $\textbf{s}$ (center) and $\textbf{n}$ (right) of an example track. The two tracks $\textbf{s}$ and $\textbf{n}$ have been computed using the $\mathcal{S}_\text{DMCS}$ model.
  • Figure 3: ROC curve showing the synthetic speech detection performances of $\mathcal{D}_\mathbf{x}$, $\mathcal{D}_\mathbf{s}$ and $\mathcal{D}_\mathbf{n}$ tested on the ASVspoof 2019 dataset, considering $\mathcal{S}_\text{DMCS}$ as noise extractor.
  • Figure 4: Balanced accuracy values scored by $\mathcal{D}_\mathbf{x}$, $\mathcal{D}_\mathbf{s}$ and $\mathcal{D}_\mathbf{n}$ tested on the ASVspoof 2019 dataset under varying MP3 compression bitrates.