Table of Contents
Fetching ...

An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification

Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Jiajun Qi

TL;DR

This paper introduces Enhanced Res2Net (ERes2Net), which fuses local and global multi-scale features for speaker verification using attentional feature fusion (AFF) in both local and global fusion modules. The Local Feature Fusion (LFF) enhances fine-grained local interactions within residual blocks, while Global Feature Fusion (GFF) aggregates multi-scale information across stages in a bottom-up pathway. Experiments on VoxCeleb show that LFF, GFF, and their combination improve verification accuracy with fewer parameters, and the approach achieves state-of-the-art results on VoxCeleb1-O with EER as low as ~0.83% and MinDCF ~0.072; code is publicly available. The work highlights the benefits of explicit local/global fusion for robust, multi-scale speaker representations.

Abstract

Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion (LFF) fuses the features within one single residual block to extract the local signal. The global feature fusion (GFF) takes acoustic features of different scales as input to aggregate global signal. To facilitate effective feature fusion in both LFF and GFF, an attentional feature fusion module is employed in the ERes2Net architecture, replacing summation or concatenation operations. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the ERes2Net in speaker verification. Code has been made publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.

An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification

TL;DR

This paper introduces Enhanced Res2Net (ERes2Net), which fuses local and global multi-scale features for speaker verification using attentional feature fusion (AFF) in both local and global fusion modules. The Local Feature Fusion (LFF) enhances fine-grained local interactions within residual blocks, while Global Feature Fusion (GFF) aggregates multi-scale information across stages in a bottom-up pathway. Experiments on VoxCeleb show that LFF, GFF, and their combination improve verification accuracy with fewer parameters, and the approach achieves state-of-the-art results on VoxCeleb1-O with EER as low as ~0.83% and MinDCF ~0.072; code is publicly available. The work highlights the benefits of explicit local/global fusion for robust, multi-scale speaker representations.

Abstract

Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion (LFF) fuses the features within one single residual block to extract the local signal. The global feature fusion (GFF) takes acoustic features of different scales as input to aggregate global signal. To facilitate effective feature fusion in both LFF and GFF, an attentional feature fusion module is employed in the ERes2Net architecture, replacing summation or concatenation operations. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the ERes2Net in speaker verification. Code has been made publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.
Paper Structure (14 sections, 4 equations, 2 figures, 4 tables)

This paper contains 14 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the enhanced Res2Net framework.
  • Figure 2: Illustration of different structures in the modules: (a) Res2Net block; (b) ERes2Net block; (c) Attentional feature fusion (AFF) module; (d) Global feature fusion (GFF) module.