Table of Contents
Fetching ...

Synthetic Speech Classification: IEEE Signal Processing Cup 2022 challenge

Mahieyin Rahmun, Rafat Hasan Khan, Tanjim Taharat Aurpa, Sadia Khan, Zulker Nayeen Nahiyan, Mir Sayad Bin Almas, Rakibul Hasan Rajib, Syeda Sakira Hassan

TL;DR

This paper tackles synthetic speech attribution within the SPCUP2022 challenge, aiming to identify which synthesis algorithm produced a given audio sample. It systematically compares classical ML (SVM, GMM) and deep learning approaches, including baseline TSSDNet variants and convolutional networks trained on raw waveform and spectral features. The study demonstrates that end-to-end deep networks operating on raw audio, particularly Inc-TSSDNet with data augmentation, yield the strongest classification performance, supported by feature analysis (t-SNE) showing superior class separation. The findings underscore the value of raw-waveform end-to-end models and augmented data for robust synthetic speech attribution, with practical implications for detector design and defense against audio spoofing.

Abstract

The aim of this project is to implement and design arobust synthetic speech classifier for the IEEE Signal ProcessingCup 2022 challenge. Here, we learn a synthetic speech attributionmodel using the speech generated from various text-to-speech(TTS) algorithms as well as unknown TTS algorithms. Weexperiment with both the classical machine learning methodssuch as support vector machine, Gaussian mixture model, anddeep learning based methods such as ResNet, VGG16, and twoshallow end-to-end networks. We observe that deep learningbased methods with raw data demonstrate the best performance.

Synthetic Speech Classification: IEEE Signal Processing Cup 2022 challenge

TL;DR

This paper tackles synthetic speech attribution within the SPCUP2022 challenge, aiming to identify which synthesis algorithm produced a given audio sample. It systematically compares classical ML (SVM, GMM) and deep learning approaches, including baseline TSSDNet variants and convolutional networks trained on raw waveform and spectral features. The study demonstrates that end-to-end deep networks operating on raw audio, particularly Inc-TSSDNet with data augmentation, yield the strongest classification performance, supported by feature analysis (t-SNE) showing superior class separation. The findings underscore the value of raw-waveform end-to-end models and augmented data for robust synthetic speech attribution, with practical implications for detector design and defense against audio spoofing.

Abstract

The aim of this project is to implement and design arobust synthetic speech classifier for the IEEE Signal ProcessingCup 2022 challenge. Here, we learn a synthetic speech attributionmodel using the speech generated from various text-to-speech(TTS) algorithms as well as unknown TTS algorithms. Weexperiment with both the classical machine learning methodssuch as support vector machine, Gaussian mixture model, anddeep learning based methods such as ResNet, VGG16, and twoshallow end-to-end networks. We observe that deep learningbased methods with raw data demonstrate the best performance.

Paper Structure

This paper contains 21 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The Inception-style TSSDNet architecture (adapted from hua2021towards).
  • Figure 2: t-SNE plots of the feature embeddings for (a) Res-TSSDNet, no augmentations, (b) Inc-TSSDNet, no augmentations, (c) MFCC, with augmentation (d) VGG16, with augmentations, (e) Res-TSSDNet, with augmentations, and (f) Inc-TSSDNet, with augmentations. The colors blue, yellow, light blue, green, light green and pink represent class labels $\{0, 1, 2, 3, 4, 5\}$, respectively.
  • Figure 3: Confusion matrices for (a) Res-TSSDNet, no augmented data, (b) Res-TSSDNet, augmented data, (c) Inc-TSSDNet, no augmented data, (d) Inc-TSSDNet, augmented data, (e) ResNet34, augmented data, and (f) SVM, augmented data.