Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification
Nian Li, Jianguo Wei
TL;DR
This work tackles the data efficiency challenge of Transformer-based speaker verification by introducing PCF-NAT, a Neighborhood Attention Transformer augmented with Progressive Channel Fusion. The model alternates local neighborhood attention and global attention, employs multi-level feature aggregation with attentive statistics pooling, and expands the channel receptive field progressively through 1D group convolutions. Trained on VoxCeleb2 and evaluated on VoxCeleb1 and VoxSRC, PCF-NAT demonstrates competitive EER/minDCF with reduced memory usage and scalable depth, achieving $<0.5\%$ EER on VoxCeleb1-O for deeper configurations. The approach offers a path toward scalable, data-efficient Transformer-based ASV with potential applicability to downstream tasks such as speech synthesis and voice conversion, while highlighting avenues for improved down-sampling and larger-scale training.
Abstract
Transformer-based architectures for speaker verification typically require more training data than ECAPA-TDNN. Therefore, recent work has generally been trained on VoxCeleb1&2. We propose a backbone network based on self-attention, which can achieve competitive results when trained on VoxCeleb2 alone. The network alternates between neighborhood attention and global attention to capture local and global features, then aggregates features of different hierarchical levels, and finally performs attentive statistics pooling. Additionally, we employ a progressive channel fusion strategy to expand the receptive field in the channel dimension as the network deepens. We trained the proposed PCF-NAT model on VoxCeleb2 and evaluated it on VoxCeleb1 and the validation sets of VoxSRC. The EER and minDCF of the shallow PCF-NAT are on average more than 20% lower than those of similarly sized ECAPA-TDNN. Deep PCF-NAT achieves an EER lower than 0.5% on VoxCeleb1-O. The code and models are publicly available at https://github.com/ChenNan1996/PCF-NAT.
