Table of Contents
Fetching ...

A Parameter-Efficient Multi-Scale Convolutional Adapter for Synthetic Speech Detection

Yassine El Kheir, Fabian Ritter-Guttierez, Arnab Das, Tim Polzehl, Sebastian Möller

TL;DR

This work tackles the challenge of efficiently adapting pre-trained SSL models for synthetic speech detection, where full fine-tuning is costly and prone to overfitting. It introduces MultiConvAdapter, a parameter-efficient module that adds multi-scale temporal bias via parallel depthwise convolutions with kernels $\ig\{3,7,15,23\big\}$, inserted after the MHSA in each Transformer layer and coupled with a channel-down projection to $D'=64$ and Mixup Conv fusion. With only $3.17$M trainable parameters (1% of the backbone) and a best-average EER of $5.91\%$ across five public datasets, the method outperforms full fine-tuning and existing PEFT approaches, demonstrating strong cross-dataset and cross-backbone generalization. Ablation studies confirm the importance of Mixup Conv fusion and post-MHSA placement, and the approach remains effective across different SSL backbones (XLSR, HuBERT, WavLM) and classifiers (AASIST, Nes2Net, BiCrossMamba-ST), offering a practical and scalable solution for robust anti-spoofing in real-world deployments.

Abstract

Recent synthetic speech detection models typically adapt a pre-trained SSL model via finetuning, which is computationally demanding. Parameter-Efficient Fine-Tuning (PEFT) offers an alternative. However, existing methods lack the specific inductive biases required to model the multi-scale temporal artifacts characteristic of spoofed audio. This paper introduces the Multi-Scale Convolutional Adapter (MultiConvAdapter), a parameter-efficient architecture designed to address this limitation. MultiConvAdapter integrates parallel convolutional modules within the SSL encoder, facilitating the simultaneous learning of discriminative features across multiple temporal resolutions, capturing both short-term artifacts and long-term distortions. With only $3.17$M trainable parameters ($1\%$ of the SSL backbone), MultiConvAdapter substantially reduces the computational burden of adaptation. Evaluations on five public datasets, demonstrate that MultiConvAdapter achieves superior performance compared to full fine-tuning and established PEFT methods.

A Parameter-Efficient Multi-Scale Convolutional Adapter for Synthetic Speech Detection

TL;DR

This work tackles the challenge of efficiently adapting pre-trained SSL models for synthetic speech detection, where full fine-tuning is costly and prone to overfitting. It introduces MultiConvAdapter, a parameter-efficient module that adds multi-scale temporal bias via parallel depthwise convolutions with kernels , inserted after the MHSA in each Transformer layer and coupled with a channel-down projection to and Mixup Conv fusion. With only M trainable parameters (1% of the backbone) and a best-average EER of across five public datasets, the method outperforms full fine-tuning and existing PEFT approaches, demonstrating strong cross-dataset and cross-backbone generalization. Ablation studies confirm the importance of Mixup Conv fusion and post-MHSA placement, and the approach remains effective across different SSL backbones (XLSR, HuBERT, WavLM) and classifiers (AASIST, Nes2Net, BiCrossMamba-ST), offering a practical and scalable solution for robust anti-spoofing in real-world deployments.

Abstract

Recent synthetic speech detection models typically adapt a pre-trained SSL model via finetuning, which is computationally demanding. Parameter-Efficient Fine-Tuning (PEFT) offers an alternative. However, existing methods lack the specific inductive biases required to model the multi-scale temporal artifacts characteristic of spoofed audio. This paper introduces the Multi-Scale Convolutional Adapter (MultiConvAdapter), a parameter-efficient architecture designed to address this limitation. MultiConvAdapter integrates parallel convolutional modules within the SSL encoder, facilitating the simultaneous learning of discriminative features across multiple temporal resolutions, capturing both short-term artifacts and long-term distortions. With only M trainable parameters ( of the SSL backbone), MultiConvAdapter substantially reduces the computational burden of adaptation. Evaluations on five public datasets, demonstrate that MultiConvAdapter achieves superior performance compared to full fine-tuning and established PEFT methods.

Paper Structure

This paper contains 15 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Trainable parameters vs EER% trade-off for PEFT methods.
  • Figure 2: Our proposed MultiConvAdapter, Multi-Scale Convolutional Adapters with multi-scale kernels $\{k_{1}, k_{2}, k_{3}, k_{4}\}$ for effecient synthetic speech detection