BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention
Yassine El Kheir, Tim Polzehl, Sebastian Möller
TL;DR
This work tackles the challenge of detecting speech deepfakes that threatenAutomatic Speaker Verification systems. It introduces BiCrossMamba-ST, a dual-branch spectro-temporal architecture built on bidirectional Mamba blocks with mutual cross-attention and a learned 2D attention map to localize artifacts in both spectral and temporal domains. The encoder uses a RawNet2-based front end, and the two branches are fused via mutual cross-attention and sequence pooling to yield a final detection score. Across ASVspoof LA19/LA21/DF21 and ASVSpoof5 benchmarks, BiCrossMamba-ST achieves strong, generalizable performance with a lighter parameter footprint than competing end-to-end models, validating its effectiveness for robust speech anti-spoofing.
Abstract
We propose BiCrossMamba-ST, a robust framework for speech deepfake detection that leverages a dual-branch spectro-temporal architecture powered by bidirectional Mamba blocks and mutual cross-attention. By processing spectral sub-bands and temporal intervals separately and then integrating their representations, BiCrossMamba-ST effectively captures the subtle cues of synthetic speech. In addition, our proposed framework leverages a convolution-based 2D attention map to focus on specific spectro-temporal regions, enabling robust deepfake detection. Operating directly on raw features, BiCrossMamba-ST achieves significant performance improvements, a 67.74% and 26.3% relative gain over state-of-the-art AASIST on ASVSpoof LA21 and ASVSpoof DF21 benchmarks, respectively, and a 6.80% improvement over RawBMamba on ASVSpoof DF21. Code and models will be made publicly available.
