Synthetic Speech Classification: IEEE Signal Processing Cup 2022 challenge
Mahieyin Rahmun, Rafat Hasan Khan, Tanjim Taharat Aurpa, Sadia Khan, Zulker Nayeen Nahiyan, Mir Sayad Bin Almas, Rakibul Hasan Rajib, Syeda Sakira Hassan
TL;DR
This paper tackles synthetic speech attribution within the SPCUP2022 challenge, aiming to identify which synthesis algorithm produced a given audio sample. It systematically compares classical ML (SVM, GMM) and deep learning approaches, including baseline TSSDNet variants and convolutional networks trained on raw waveform and spectral features. The study demonstrates that end-to-end deep networks operating on raw audio, particularly Inc-TSSDNet with data augmentation, yield the strongest classification performance, supported by feature analysis (t-SNE) showing superior class separation. The findings underscore the value of raw-waveform end-to-end models and augmented data for robust synthetic speech attribution, with practical implications for detector design and defense against audio spoofing.
Abstract
The aim of this project is to implement and design arobust synthetic speech classifier for the IEEE Signal ProcessingCup 2022 challenge. Here, we learn a synthetic speech attributionmodel using the speech generated from various text-to-speech(TTS) algorithms as well as unknown TTS algorithms. Weexperiment with both the classical machine learning methodssuch as support vector machine, Gaussian mixture model, anddeep learning based methods such as ResNet, VGG16, and twoshallow end-to-end networks. We observe that deep learningbased methods with raw data demonstrate the best performance.
