Synthetic Audio Forensics Evaluation (SAFE) Challenge
Kirill Trapeznikov, Paul Cummer, Pranay Pherwani, Jai Aslam, Michael S. Davinroy, Peter Bautista, Laura Cassani, Matthew Stamm, Jill Crisman
TL;DR
SAFE addresses the rising challenge of authenticating audio in the era of highly realistic synthetic speech by introducing a fully blind, three-task benchmark that probes detection robustness to post-processing and laundering. The approach leverages a source-balanced corpus with 21 real sources and 17 TTS models, totaling roughly 90 hours, and uses a Hugging Face-driven evaluation framework with public/private splits. Results from Round 1 show strong performance on raw synthetic speech but marked degradation under realistic processing and laundering, highlighting gaps in generalization and resilience. The framework lays a scalable foundation for ongoing advancement in audio forensics, offering a practical pathway to improve detectors against adversarial manipulation and unseen sources.
Abstract
The increasing realism of synthetic speech generated by advanced text-to-speech (TTS) models, coupled with post-processing and laundering techniques, presents a significant challenge for audio forensic detection. In this paper, we introduce the SAFE (Synthetic Audio Forensics Evaluation) Challenge, a fully blind evaluation framework designed to benchmark detection models across progressively harder scenarios: raw synthetic speech, processed audio (e.g., compression, resampling), and laundered audio intended to evade forensic analysis. The SAFE challenge consisted of a total of 90 hours of audio and 21,000 audio samples split across 21 different real sources and 17 different TTS models and 3 tasks. We present the challenge, evaluation design and tasks, dataset details, and initial insights into the strengths and limitations of current approaches, offering a foundation for advancing synthetic audio detection research. More information is available at \href{https://stresearch.github.io/SAFE/}{https://stresearch.github.io/SAFE/}.
