Table of Contents
Fetching ...

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Xi Xuan, Xuechen Liu, Wenxin Zhang, Yi-Cheng Lin, Xiaojian Lin, Tomi Kinnunen

TL;DR

This work targets cross-domain generalization in speech deepfake detection by moving beyond full fine-tuning of large SSL models. It introduces parameter-efficient front-ends that fuse prompt tuning with classical transforms, notably FourierPT-XLSR and wavelet-based variants (WSPT-XLSR, Partial-WSPT-XLSR). The WaveSP-Net architecture combines a Partial-WSPT-XLSR front-end with a Mamba back-end and a learnable wavelet processing pipeline (Learnable Wavelet Decomposition, Wavelet Domain Sparsification, Learnable Wavelet Reconstruction) to enrich prompt embeddings while freezing the XLSR backbone. Empirical results on DE24 and SpoofCeleb show state-of-the-art performance with only about 1.3% of total trainable parameters, highlighting the effectiveness of wavelet-domain prompts for detecting subtle synthetic artifacts in speech.

Abstract

Modern front-end design for speech deepfake detection relies on full fine-tuning of large pre-trained models like XLSR. However, this approach is not parameter-efficient and may lead to suboptimal generalization to realistic, in-the-wild data types. To address these limitations, we introduce a new family of parameter-efficient front-ends that fuse prompt-tuning with classical signal processing transforms. These include FourierPT-XLSR, which uses the Fourier Transform, and two variants based on the Wavelet Transform: WSPT-XLSR and Partial-WSPT-XLSR. We further propose WaveSP-Net, a novel architecture combining a Partial-WSPT-XLSR front-end and a bidirectional Mamba-based back-end. This design injects multi-resolution features into the prompt embeddings, which enhances the localization of subtle synthetic artifacts without altering the frozen XLSR parameters. Experimental results demonstrate that WaveSP-Net outperforms several state-of-the-art models on two new and challenging benchmarks, Deepfake-Eval-2024 and SpoofCeleb, with low trainable parameters and notable performance gains. The code and models are available at https://github.com/xxuan-acoustics/WaveSP-Net.

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

TL;DR

This work targets cross-domain generalization in speech deepfake detection by moving beyond full fine-tuning of large SSL models. It introduces parameter-efficient front-ends that fuse prompt tuning with classical transforms, notably FourierPT-XLSR and wavelet-based variants (WSPT-XLSR, Partial-WSPT-XLSR). The WaveSP-Net architecture combines a Partial-WSPT-XLSR front-end with a Mamba back-end and a learnable wavelet processing pipeline (Learnable Wavelet Decomposition, Wavelet Domain Sparsification, Learnable Wavelet Reconstruction) to enrich prompt embeddings while freezing the XLSR backbone. Empirical results on DE24 and SpoofCeleb show state-of-the-art performance with only about 1.3% of total trainable parameters, highlighting the effectiveness of wavelet-domain prompts for detecting subtle synthetic artifacts in speech.

Abstract

Modern front-end design for speech deepfake detection relies on full fine-tuning of large pre-trained models like XLSR. However, this approach is not parameter-efficient and may lead to suboptimal generalization to realistic, in-the-wild data types. To address these limitations, we introduce a new family of parameter-efficient front-ends that fuse prompt-tuning with classical signal processing transforms. These include FourierPT-XLSR, which uses the Fourier Transform, and two variants based on the Wavelet Transform: WSPT-XLSR and Partial-WSPT-XLSR. We further propose WaveSP-Net, a novel architecture combining a Partial-WSPT-XLSR front-end and a bidirectional Mamba-based back-end. This design injects multi-resolution features into the prompt embeddings, which enhances the localization of subtle synthetic artifacts without altering the frozen XLSR parameters. Experimental results demonstrate that WaveSP-Net outperforms several state-of-the-art models on two new and challenging benchmarks, Deepfake-Eval-2024 and SpoofCeleb, with low trainable parameters and notable performance gains. The code and models are available at https://github.com/xxuan-acoustics/WaveSP-Net.

Paper Structure

This paper contains 14 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the WaveSP-Net architecture. The figure illustrates five different XLSR-based front-end variants: (a) PT-XLSR, (b) FourierPT-XLSR, (c) WPT-XLSR, (d) WSPT-XLSR and Partial-WSPT-XLSR. The proposed WaveSP-Net (rightmost panel) integrates a Partial-WSPT-XLSR front-end (bottom right) with a Mamba-based classifier (top right).
  • Figure 2: 2D t-SNE visualization of the Deepfake-Eval-2024 test set.