MASV: Speaker Verification with Global and Local Context Mamba
Yang Liu, Li Wan, Yiteng Huang, Ming Sun, Yangyang Shi, Florian Metze
TL;DR
The paper addresses the challenge of achieving accurate, real-time speaker verification without excessive computation by combining state-space modeling with local and global context mechanisms. It introduces MASV, a Mamba-based architecture embedded in ECAPA-TDNN, featuring Local Context Bidirectional Mamba and Tri-Mamba blocks along with full-scale skip connections to fuse local and global audio context and preserve feature flow. Empirical results on a large, private dataset show MASV reduces EER and minDCF relative to ResNet and PCF-ECAPA baselines while maintaining or reducing FLOPS, with ablations demonstrating the contribution of each module. The approach offers practical benefits for edge devices and streaming applications by delivering improved verification performance with linear or near-linear computational scaling with sequence length, thanks to the Mamba/S4-based design.
Abstract
Deep learning models like Convolutional Neural Networks and transformers have shown impressive capabilities in speech verification, gaining considerable attention in the research community. However, CNN-based approaches struggle with modeling long-sequence audio effectively, resulting in suboptimal verification performance. On the other hand, transformer-based methods are often hindered by high computational demands, limiting their practicality. This paper presents the MASV model, a novel architecture that integrates the Mamba module into the ECAPA-TDNN framework. By introducing the Local Context Bidirectional Mamba and Tri-Mamba block, the model effectively captures both global and local context within audio sequences. Experimental results demonstrate that the MASV model substantially enhances verification performance, surpassing existing models in both accuracy and efficiency.
