Table of Contents
Fetching ...

ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts

Ashi Garg, Zexin Cai, Lin Zhang, Henry Li Xinyuan, Leibny Paola García-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews

TL;DR

ShiftySpeech tackles the problem of synthetic speech detector robustness under distribution shifts by introducing a large-scale, controllable benchmark (over 3,000 hours across 7 source domains, 6 TTS systems, 12 vocoders, and 3 languages). It systematically analyzes generalization of state-of-the-art detectors using SSL-based features under diverse shifts, including vocoder choice, TTS, and cross-language scenarios. The results show substantial degradation in detection performance under tested distribution shifts, highlighting the relevance of shift-aware benchmarking and the need for robust, adaptable detectors. The work provides a practical framework and dataset to guide future development of detectors that remain effective in realistic, evolving deployment conditions.

Abstract

The problem of synthetic speech detection has enjoyed considerable attention, with recent methods achieving low error rates across several established benchmarks. However, to what extent can low error rates on academic benchmarks translate to more realistic conditions? In practice, while the training set is fixed at one point in time, test-time conditions may exhibit distribution shifts relative to the training conditions, such as changes in speaker characteristics, emotional expressiveness, language and acoustic conditions, and the emergence of novel synthesis methods. Although some existing datasets target subsets of these distribution shifts, systematic analysis remains difficult due to inconsistencies between source data and synthesis systems across datasets. This difficulty is further exacerbated by the rapid development of new text-to-speech (TTS) and vocoder systems, which continually expand the diversity of synthetic speech. To enable systematic benchmarking of model performance under distribution shifts, we introduce ShiftySpeech, a large-scale benchmark comprising over 3,000 hours of synthetic speech across 7 source domains, 6 TTS systems, 12 vocoders, and 3 languages. ShiftySpeech is specifically designed to evaluate model generalization under controlled distribution shifts while ensuring broad coverage of modern synthetic speech generation techniques. It fills a key gap in current benchmarks by supporting fine-grained, controlled analysis of generalization robustness. All tested distribution shifts significantly degrade detection performance of state-of-the-art detection approaches based on self-supervised features. Overall, our findings suggest that reliance on synthetic speech detection methods in production environments should be carefully evaluated based on anticipated distribution shifts.

ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts

TL;DR

ShiftySpeech tackles the problem of synthetic speech detector robustness under distribution shifts by introducing a large-scale, controllable benchmark (over 3,000 hours across 7 source domains, 6 TTS systems, 12 vocoders, and 3 languages). It systematically analyzes generalization of state-of-the-art detectors using SSL-based features under diverse shifts, including vocoder choice, TTS, and cross-language scenarios. The results show substantial degradation in detection performance under tested distribution shifts, highlighting the relevance of shift-aware benchmarking and the need for robust, adaptable detectors. The work provides a practical framework and dataset to guide future development of detectors that remain effective in realistic, evolving deployment conditions.

Abstract

The problem of synthetic speech detection has enjoyed considerable attention, with recent methods achieving low error rates across several established benchmarks. However, to what extent can low error rates on academic benchmarks translate to more realistic conditions? In practice, while the training set is fixed at one point in time, test-time conditions may exhibit distribution shifts relative to the training conditions, such as changes in speaker characteristics, emotional expressiveness, language and acoustic conditions, and the emergence of novel synthesis methods. Although some existing datasets target subsets of these distribution shifts, systematic analysis remains difficult due to inconsistencies between source data and synthesis systems across datasets. This difficulty is further exacerbated by the rapid development of new text-to-speech (TTS) and vocoder systems, which continually expand the diversity of synthetic speech. To enable systematic benchmarking of model performance under distribution shifts, we introduce ShiftySpeech, a large-scale benchmark comprising over 3,000 hours of synthetic speech across 7 source domains, 6 TTS systems, 12 vocoders, and 3 languages. ShiftySpeech is specifically designed to evaluate model generalization under controlled distribution shifts while ensuring broad coverage of modern synthetic speech generation techniques. It fills a key gap in current benchmarks by supporting fine-grained, controlled analysis of generalization robustness. All tested distribution shifts significantly degrade detection performance of state-of-the-art detection approaches based on self-supervised features. Overall, our findings suggest that reliance on synthetic speech detection methods in production environments should be carefully evaluated based on anticipated distribution shifts.

Paper Structure

This paper contains 43 sections, 4 figures, 45 tables.

Figures (4)

  • Figure 1: Illustration of distribution shift in ShiftySpeech. Train–test mismatches can arise from differences in language, speaker, or recording conditions.
  • Figure 2: ShiftySpeech data generation pipeline illustrating three synthetic speech generation processes: text-to-speech (TTS), voice conversion (VC), and re-vocoding.
  • Figure 3: Average EERs reported with models trained on increasing number of speakers with HFG generated audio samples (LibriTTS, train-clean-360). For each dataset average EER on all 12 test vocoders are plotted. Test samples are also generated using HFG vocoder. aEER drops significantly when the number of speakers is increased to four. No significant performance gain was observed thereafter, for most cases.
  • Figure 4: Average EERs reported with models trained on increasing number of speakers with HFG generated audio samples (LibriTTS, train-clean-360). For each dataset average EER on all 12 test vocoders are plotted. Test samples are generated using PWG vocoder. aEER drops significantly when number of speakers are increased to four. No significant performance gain was observed thereafter, for most cases.