Table of Contents
Fetching ...

Synthetic Audio Forensics Evaluation (SAFE) Challenge

Kirill Trapeznikov, Paul Cummer, Pranay Pherwani, Jai Aslam, Michael S. Davinroy, Peter Bautista, Laura Cassani, Matthew Stamm, Jill Crisman

TL;DR

SAFE addresses the rising challenge of authenticating audio in the era of highly realistic synthetic speech by introducing a fully blind, three-task benchmark that probes detection robustness to post-processing and laundering. The approach leverages a source-balanced corpus with 21 real sources and 17 TTS models, totaling roughly 90 hours, and uses a Hugging Face-driven evaluation framework with public/private splits. Results from Round 1 show strong performance on raw synthetic speech but marked degradation under realistic processing and laundering, highlighting gaps in generalization and resilience. The framework lays a scalable foundation for ongoing advancement in audio forensics, offering a practical pathway to improve detectors against adversarial manipulation and unseen sources.

Abstract

The increasing realism of synthetic speech generated by advanced text-to-speech (TTS) models, coupled with post-processing and laundering techniques, presents a significant challenge for audio forensic detection. In this paper, we introduce the SAFE (Synthetic Audio Forensics Evaluation) Challenge, a fully blind evaluation framework designed to benchmark detection models across progressively harder scenarios: raw synthetic speech, processed audio (e.g., compression, resampling), and laundered audio intended to evade forensic analysis. The SAFE challenge consisted of a total of 90 hours of audio and 21,000 audio samples split across 21 different real sources and 17 different TTS models and 3 tasks. We present the challenge, evaluation design and tasks, dataset details, and initial insights into the strengths and limitations of current approaches, offering a foundation for advancing synthetic audio detection research. More information is available at \href{https://stresearch.github.io/SAFE/}{https://stresearch.github.io/SAFE/}.

Synthetic Audio Forensics Evaluation (SAFE) Challenge

TL;DR

SAFE addresses the rising challenge of authenticating audio in the era of highly realistic synthetic speech by introducing a fully blind, three-task benchmark that probes detection robustness to post-processing and laundering. The approach leverages a source-balanced corpus with 21 real sources and 17 TTS models, totaling roughly 90 hours, and uses a Hugging Face-driven evaluation framework with public/private splits. Results from Round 1 show strong performance on raw synthetic speech but marked degradation under realistic processing and laundering, highlighting gaps in generalization and resilience. The framework lays a scalable foundation for ongoing advancement in audio forensics, offering a practical pathway to improve detectors against adversarial manipulation and unseen sources.

Abstract

The increasing realism of synthetic speech generated by advanced text-to-speech (TTS) models, coupled with post-processing and laundering techniques, presents a significant challenge for audio forensic detection. In this paper, we introduce the SAFE (Synthetic Audio Forensics Evaluation) Challenge, a fully blind evaluation framework designed to benchmark detection models across progressively harder scenarios: raw synthetic speech, processed audio (e.g., compression, resampling), and laundered audio intended to evade forensic analysis. The SAFE challenge consisted of a total of 90 hours of audio and 21,000 audio samples split across 21 different real sources and 17 different TTS models and 3 tasks. We present the challenge, evaluation design and tasks, dataset details, and initial insights into the strengths and limitations of current approaches, offering a foundation for advancing synthetic audio detection research. More information is available at \href{https://stresearch.github.io/SAFE/}{https://stresearch.github.io/SAFE/}.

Paper Structure

This paper contains 17 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Synthetic Audio Forensics Evaluation Challenge Round 1 Results. Performance (circle markers) from top five teams and the detection vs. false alarm curves are shown for three tasks of increasing difficulty: detection of (1) synthetic voice audio, (2) synthetic voice audio post-processed with various compression and resampling and (3) laundered to evade detection.
  • Figure 2: Task 1 TNR results conditioned on real source in Round 2.
  • Figure 3: Task 1 TPR results conditioned on generated source in Round 2.
  • Figure 4: Task 2 balanced accuracy conditioned on augmentation in Round 2.
  • Figure 5: Task 3 balanced accuracy conditioned on laundering technique in Round 2.
  • ...and 2 more figures