SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition

Tianhao Wang; Lantian Li; Dong Wang

SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition

Tianhao Wang, Lantian Li, Dong Wang

TL;DR

The paper tackles domain mismatch in speaker recognition by proposing SE/BN adapter, a lightweight module that freezes the backbone and tunes only squeeze-and-excitation and batch-normalization components to adapt to new domains. By leveraging VoxCeleb2 for pretraining and CN-Celeb genres for adaptation, the SE/BN adapter achieves strong gains over the frozen baseline and can approach or surpass full fine-tuning in low-resource settings, using roughly $88.3K$ parameters (about 1% of the full model). The combination of SE and BN captures both pattern weighting and distribution shifts, offering a practical solution for rapid cross-domain adaptation with minimal data. Across experiments, the SE/BN approach shows complementary benefits to fine-tuning and demonstrates stability in optimization, pointing to scalable deployment for diverse environments. Future work will extend evaluations to other datasets and architectures and study distributional drift across domains.

Abstract

Deploying a well-optimized pre-trained speaker recognition model in a new domain often leads to a significant decline in performance. While fine-tuning is a commonly employed solution, it demands ample adaptation data and suffers from parameter inefficiency, rendering it impractical for real-world applications with limited data available for model adaptation. Drawing inspiration from the success of adapters in self-supervised pre-trained models, this paper introduces a SE/BN adapter to address this challenge. By freezing the core speaker encoder and adjusting the feature maps' weights and activation distributions, we introduce a novel adapter utilizing trainable squeeze-and-excitation (SE) blocks and batch normalization (BN) layers, termed SE/BN adapter. Our experiments, conducted using VoxCeleb for pre-training and 4 genres from CN-Celeb for adaptation, demonstrate that the SE/BN adapter offers significant performance improvement over the baseline and competes with the vanilla fine-tuning approach by tuning just 1% of the parameters.

SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition

TL;DR

parameters (about 1% of the full model). The combination of SE and BN captures both pattern weighting and distribution shifts, offering a practical solution for rapid cross-domain adaptation with minimal data. Across experiments, the SE/BN approach shows complementary benefits to fine-tuning and demonstrates stability in optimization, pointing to scalable deployment for diverse environments. Future work will extend evaluations to other datasets and architectures and study distributional drift across domains.

Abstract

Paper Structure (12 sections, 4 equations, 2 figures, 5 tables)

This paper contains 12 sections, 4 equations, 2 figures, 5 tables.

Introduction
Related Work
SE/BN Adapter
Revisit SE
SE layer is domain specific
SE/BN Adapter
Experiments
Data
Settings
Basic Results with Adapters
Results with Limited Data
Conclusion

Figures (2)

Figure 1: Illustration of the Squeeze-and-Excitation (SE) block. The excitation function $f=\sigma (\mathbf{W}_2 \delta (\mathbf{W}_1 \mathbf{z}))$ is depicted, where $\mathbf{W}_1$ and $\mathbf{W}_2$ matrices are shown.
Figure 2: SE/BN adapter based on SE and BN. (a) A ResNet block with SE blocks and BN layers. (b) BN layer. The yellow color indicates trainable parameters.

SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition

TL;DR

Abstract

SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (2)