A Parameter-Efficient Multi-Scale Convolutional Adapter for Synthetic Speech Detection

Yassine El Kheir; Fabian Ritter-Guttierez; Arnab Das; Tim Polzehl; Sebastian Möller

A Parameter-Efficient Multi-Scale Convolutional Adapter for Synthetic Speech Detection

Yassine El Kheir, Fabian Ritter-Guttierez, Arnab Das, Tim Polzehl, Sebastian Möller

TL;DR

This work tackles the challenge of efficiently adapting pre-trained SSL models for synthetic speech detection, where full fine-tuning is costly and prone to overfitting. It introduces MultiConvAdapter, a parameter-efficient module that adds multi-scale temporal bias via parallel depthwise convolutions with kernels $\ig\{3,7,15,23\big\}$, inserted after the MHSA in each Transformer layer and coupled with a channel-down projection to $D'=64$ and Mixup Conv fusion. With only $3.17$M trainable parameters (1% of the backbone) and a best-average EER of $5.91\%$ across five public datasets, the method outperforms full fine-tuning and existing PEFT approaches, demonstrating strong cross-dataset and cross-backbone generalization. Ablation studies confirm the importance of Mixup Conv fusion and post-MHSA placement, and the approach remains effective across different SSL backbones (XLSR, HuBERT, WavLM) and classifiers (AASIST, Nes2Net, BiCrossMamba-ST), offering a practical and scalable solution for robust anti-spoofing in real-world deployments.

Abstract

Recent synthetic speech detection models typically adapt a pre-trained SSL model via finetuning, which is computationally demanding. Parameter-Efficient Fine-Tuning (PEFT) offers an alternative. However, existing methods lack the specific inductive biases required to model the multi-scale temporal artifacts characteristic of spoofed audio. This paper introduces the Multi-Scale Convolutional Adapter (MultiConvAdapter), a parameter-efficient architecture designed to address this limitation. MultiConvAdapter integrates parallel convolutional modules within the SSL encoder, facilitating the simultaneous learning of discriminative features across multiple temporal resolutions, capturing both short-term artifacts and long-term distortions. With only $3.17$M trainable parameters ($1\%$ of the SSL backbone), MultiConvAdapter substantially reduces the computational burden of adaptation. Evaluations on five public datasets, demonstrate that MultiConvAdapter achieves superior performance compared to full fine-tuning and established PEFT methods.

A Parameter-Efficient Multi-Scale Convolutional Adapter for Synthetic Speech Detection

TL;DR

, inserted after the MHSA in each Transformer layer and coupled with a channel-down projection to

and Mixup Conv fusion. With only

M trainable parameters (1% of the backbone) and a best-average EER of

across five public datasets, the method outperforms full fine-tuning and existing PEFT approaches, demonstrating strong cross-dataset and cross-backbone generalization. Ablation studies confirm the importance of Mixup Conv fusion and post-MHSA placement, and the approach remains effective across different SSL backbones (XLSR, HuBERT, WavLM) and classifiers (AASIST, Nes2Net, BiCrossMamba-ST), offering a practical and scalable solution for robust anti-spoofing in real-world deployments.

Abstract

M trainable parameters (

of the SSL backbone), MultiConvAdapter substantially reduces the computational burden of adaptation. Evaluations on five public datasets, demonstrate that MultiConvAdapter achieves superior performance compared to full fine-tuning and established PEFT methods.

A Parameter-Efficient Multi-Scale Convolutional Adapter for Synthetic Speech Detection

TL;DR

Abstract

A Parameter-Efficient Multi-Scale Convolutional Adapter for Synthetic Speech Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)