Table of Contents
Fetching ...

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck

TL;DR

The paper tackles limitations of TDNN-based speaker verification by introducing ECAPA-TDNN, which combines 1-D SE-Res2Blocks, channel- and context-dependent statistics pooling, and multi-layer feature aggregation with summed residual connections. This architecture enhances channel interdependencies modeling, expands temporal context, and fuses hierarchical features to produce more discriminative embeddings. Evaluations on VoxCeleb1 and VoxSRC2019 show substantial improvements over strong baselines with competitive parameter counts, demonstrating the approach's efficacy for robust, scalable speaker verification. Overall, ECAPA-TDNN advances TDNN-based systems by integrating channel attention, propagation, and aggregation to achieve state-of-the-art performance.

Abstract

Current speaker verification techniques rely on a neural network to extract speaker representations. The successful x-vector architecture is a Time Delay Neural Network (TDNN) that applies statistics pooling to project variable-length utterances into fixed-length speaker characterizing embeddings. In this paper, we propose multiple enhancements to this architecture based on recent trends in the related fields of face verification and computer vision. Firstly, the initial frame layers can be restructured into 1-dimensional Res2Net modules with impactful skip connections. Similarly to SE-ResNet, we introduce Squeeze-and-Excitation blocks in these modules to explicitly model channel interdependencies. The SE block expands the temporal context of the frame layer by rescaling the channels according to global properties of the recording. Secondly, neural networks are known to learn hierarchical features, with each layer operating on a different level of complexity. To leverage this complementary information, we aggregate and propagate features of different hierarchical levels. Finally, we improve the statistics pooling module with channel-dependent frame attention. This enables the network to focus on different subsets of frames during each of the channel's statistics estimation. The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

TL;DR

The paper tackles limitations of TDNN-based speaker verification by introducing ECAPA-TDNN, which combines 1-D SE-Res2Blocks, channel- and context-dependent statistics pooling, and multi-layer feature aggregation with summed residual connections. This architecture enhances channel interdependencies modeling, expands temporal context, and fuses hierarchical features to produce more discriminative embeddings. Evaluations on VoxCeleb1 and VoxSRC2019 show substantial improvements over strong baselines with competitive parameter counts, demonstrating the approach's efficacy for robust, scalable speaker verification. Overall, ECAPA-TDNN advances TDNN-based systems by integrating channel attention, propagation, and aggregation to achieve state-of-the-art performance.

Abstract

Current speaker verification techniques rely on a neural network to extract speaker representations. The successful x-vector architecture is a Time Delay Neural Network (TDNN) that applies statistics pooling to project variable-length utterances into fixed-length speaker characterizing embeddings. In this paper, we propose multiple enhancements to this architecture based on recent trends in the related fields of face verification and computer vision. Firstly, the initial frame layers can be restructured into 1-dimensional Res2Net modules with impactful skip connections. Similarly to SE-ResNet, we introduce Squeeze-and-Excitation blocks in these modules to explicitly model channel interdependencies. The SE block expands the temporal context of the frame layer by rescaling the channels according to global properties of the recording. Secondly, neural networks are known to learn hierarchical features, with each layer operating on a different level of complexity. To leverage this complementary information, we aggregate and propagate features of different hierarchical levels. Finally, we improve the statistics pooling module with channel-dependent frame attention. This enables the network to focus on different subsets of frames during each of the channel's statistics estimation. The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.

Paper Structure

This paper contains 14 sections, 7 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The SE-Res2Block of the ECAPA-TDNN architecture. The standard Conv1D layers have a kernel size of 1. The central Res2Net res2net Conv1D with scale dimension $s=8$ expands the temporal context through kernel size $k$ and dilation spacing $d$.
  • Figure 2: Network topology of the ECAPA-TDNN. We denote $k$ for kernel size and $d$ for dilation spacing of the Conv1D layers or SE-Res2Blocks. $C$ and $T$ correspond to the channel and temporal dimension of the intermediate feature-maps respectively. $S$ is the number of training speakers.