Table of Contents
Fetching ...

Cyclostationarity Analysis as a Complement to Self-Supervised Representations for Speech Deepfake Detection

Cemal Hanilçi, Md Sahidullah, Tomi Kinnunen

TL;DR

A cyclostationarity-inspired acoustic feature extraction framework for speech deepfake detection based on spectral correlation density (SCD) is introduced, which highlights cyclostationary signal analysis as a theoretically grounded and effective front end for speech deepfake detection.

Abstract

Speech deepfake detection (SDD) is essential for maintaining trust in voice-driven technologies and digital media. Although recent SDD systems increasingly rely on self-supervised learning (SSL) representations that capture rich contextual information, complementary signal-driven acoustic features remain important for modeling fine-grained structural properties of speech. Most existing acoustic front ends are based on time-frequency representations, which do not fully exploit higher-order spectral dependencies inherent in speech signals. We introduce a cyclostationarity-inspired acoustic feature extraction framework for SDD based on spectral correlation density (SCD). The proposed features model periodic statistical structures in speech by capturing spectral correlations between frequency components. In particular, we propose temporally structured SCD features that characterize the evolution of spectral and cyclic-frequency components over time. The effectiveness and complementarity of the proposed features are evaluated using multiple countermeasure architectures, including convolutional neural networks, SSL-based embedding systems, and hybrid fusion models. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and ASVspoof 5 demonstrate that SCD-based features provide complementary discriminative information to SSL embeddings and conventional acoustic representations. In particular, fusion of SSL and SCD embeddings reduces the equal error rate on ASVspoof 2019 LA from $8.28\%$ to $0.98\%$, and yields consistent improvements on the challenging ASVspoof 5 dataset. The results highlight cyclostationary signal analysis as a theoretically grounded and effective front end for speech deepfake detection.

Cyclostationarity Analysis as a Complement to Self-Supervised Representations for Speech Deepfake Detection

TL;DR

A cyclostationarity-inspired acoustic feature extraction framework for speech deepfake detection based on spectral correlation density (SCD) is introduced, which highlights cyclostationary signal analysis as a theoretically grounded and effective front end for speech deepfake detection.

Abstract

Speech deepfake detection (SDD) is essential for maintaining trust in voice-driven technologies and digital media. Although recent SDD systems increasingly rely on self-supervised learning (SSL) representations that capture rich contextual information, complementary signal-driven acoustic features remain important for modeling fine-grained structural properties of speech. Most existing acoustic front ends are based on time-frequency representations, which do not fully exploit higher-order spectral dependencies inherent in speech signals. We introduce a cyclostationarity-inspired acoustic feature extraction framework for SDD based on spectral correlation density (SCD). The proposed features model periodic statistical structures in speech by capturing spectral correlations between frequency components. In particular, we propose temporally structured SCD features that characterize the evolution of spectral and cyclic-frequency components over time. The effectiveness and complementarity of the proposed features are evaluated using multiple countermeasure architectures, including convolutional neural networks, SSL-based embedding systems, and hybrid fusion models. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and ASVspoof 5 demonstrate that SCD-based features provide complementary discriminative information to SSL embeddings and conventional acoustic representations. In particular, fusion of SSL and SCD embeddings reduces the equal error rate on ASVspoof 2019 LA from to , and yields consistent improvements on the challenging ASVspoof 5 dataset. The results highlight cyclostationary signal analysis as a theoretically grounded and effective front end for speech deepfake detection.
Paper Structure (21 sections, 11 equations, 9 figures, 4 tables)

This paper contains 21 sections, 11 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Visualization of four modulated signals $x_1[n]$–$x_4[n]$ defined in \ref{['eq:ex1_1']}–\ref{['eq:ex1_4']}. From top to bottom: time-domain waveforms (first 320 samples, corresponding to 20 ms), STFT spectrograms, two-dimensional SCD representations $\text{SCD}(f,\alpha)$, one-dimensional $\text{SCD}(\alpha)$ obtained by averaging $\text{SCD}(f,\alpha)$ over spectral frequency $f$, and one-dimensional $\text{SCD}(f)$ obtained by averaging over cyclic frequency $\alpha$. For clarity of visualization, the one-dimensional $\text{SCD}(\alpha)$ and $\text{SCD}(f)$ plots are shown over the frequency range $\alpha, f \in [0, 1000]$ Hz.
  • Figure 2: Comparison of bonafide and synthesized speech signals in the time and cyclostationary domains. Top row: Time-domain waveforms of the original (bonafide) speech signal and speech synthesized using sinusoidal modeling, LPC, WORLD, and neural vocoders. Middle row: Corresponding SCD representations $\text{SCD}(f,\alpha)$. Bottom row: SCD difference maps between the bonafide and synthesized signals, highlighting vocoder-induced artifacts. All synthesized signals were generated from the same bonafide utterance. For the clarity of visualization, only the first one-second long portion of the speech waveforms are displayed in the figure while SCD representations are estimated using the entire utterance.
  • Figure 3: Estimation of conventional and proposed temporal SCD features ($\text{SCD}_a$, $\text{SCD}_b$) from speech signal $s[n]$. $^{*}$ denotes complex conjugation.
  • Figure 4: Visualization of time–frequency representations for a bonafide speech signal and its HiFi-GAN (v3) synthetic version. The first row shows the STFT spectrogram, the temporal SC representation $\mathrm{SCD}_a(\alpha,t)$, and the spectral-domain representation $\mathrm{SCD}_b(f,t)$ for the bonafide signal. The second row shows the corresponding representations for the synthetic signal. The third row depicts relative difference maps between the bonafide and synthetic features.
  • Figure 5: Block diagrams of the four evaluated speech deepfake detection CMs: (a) SE-Res2Net50 CM using handcrafted acoustic features; (b) SSL–SE-Res2Net50 CM using frozen Wav2Vec 2.0 frame-level features; (c) SSL-only CM using utterance-level Wav2Vec 2.0 embeddings with lightweight projection layers; and (d) embedding fusion CM combining frozen SE-Res2Net50 and SSL embeddings via weighted sum (W. Sum), addition (Add), or concatenation (Concat.). Dashed blocks denote frozen modules; solid blocks denote trainable layers. The dimensionalities of intermediate representations are shown. The table reports the number of trainable parameters for each system (rounded to the nearest thousand), excluding the frozen SSL backbone.
  • ...and 4 more figures