Table of Contents
Fetching ...

RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection

Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Lv Zhao, Cunhang Fan

TL;DR

This work tackles audio deepfake detection by requiring both short-range artefacts and long-range contextual cues. It introduces RawBMamba, an end-to-end bidirectional state-space model that uses a sinc-based front-end to extract short-range features and dual forward/backward Mamba streams to model long-range dependencies, with a bidirectional fusion module to integrate the information. Empirical results on ASVspoof benchmarks show substantial improvements over strong baselines, notably a 34.1% gain over SE-Rawformer on ASVspoof 2021 LA and robust performance on out-of-domain data such as 21DF. The findings validate the potential of bidirectional state-space architectures for reliable, generalizable audio deepfake detection and suggest practical applicability in real-world systems.

Abstract

Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepfake detection. Specifically, we use sinc Layer and multiple convolutional layers to capture short-range features, and then design a bidirectional Mamba to address Mamba's unidirectional modelling problem and further capture long-range feature information. Moreover, we develop a bidirectional fusion module to integrate embeddings, enhancing audio context representation and combining short- and long-range information. The results show that our proposed RawBMamba achieves a 34.1\% improvement over Rawformer on ASVspoof2021 LA dataset, and demonstrates competitive performance on other datasets.

RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection

TL;DR

This work tackles audio deepfake detection by requiring both short-range artefacts and long-range contextual cues. It introduces RawBMamba, an end-to-end bidirectional state-space model that uses a sinc-based front-end to extract short-range features and dual forward/backward Mamba streams to model long-range dependencies, with a bidirectional fusion module to integrate the information. Empirical results on ASVspoof benchmarks show substantial improvements over strong baselines, notably a 34.1% gain over SE-Rawformer on ASVspoof 2021 LA and robust performance on out-of-domain data such as 21DF. The findings validate the potential of bidirectional state-space architectures for reliable, generalizable audio deepfake detection and suggest practical applicability in real-world systems.

Abstract

Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepfake detection. Specifically, we use sinc Layer and multiple convolutional layers to capture short-range features, and then design a bidirectional Mamba to address Mamba's unidirectional modelling problem and further capture long-range feature information. Moreover, we develop a bidirectional fusion module to integrate embeddings, enhancing audio context representation and combining short- and long-range information. The results show that our proposed RawBMamba achieves a 34.1\% improvement over Rawformer on ASVspoof2021 LA dataset, and demonstrates competitive performance on other datasets.
Paper Structure (14 sections, 6 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 6 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: The overall structure diagram of our proposed RawBMamba.
  • Figure 2: The 19LA test set samples' clustering is shown in 2D t-SNE plots from the model's higher layers. The visualization displays the clustering of bonafide vs. fake audio in (a) and (b) (blue for bonafide, red for fake) and different attack types in (c) and (d), with each color indicating a specific attack.