Table of Contents
Fetching ...

Speaker Characterization by means of Attention Pooling

Federico Costa, Miquel India, Javier Hernando

TL;DR

This work extends Double Multi-Head Self-Attention (DMHSA) pooling for speaker verification and demonstrates its applicability to speaker characterization tasks such as emotion recognition, sex classification, and COVID-19 detection. The DMHSA architecture sits between a CNN front-end and fully connected layers, learning robust speaker embeddings from variable-length inputs via multi-head attention across temporal and head-wise dimensions. In SV, DMHSA outperforms vanilla Self-Attention and MHSA pooling on VoxCeleb benchmarks, with the best 32-head configuration achieving notable EER reductions. For SC tasks, DMHSA shows competitive SER performance, excellent sex-classification accuracy on Catalan CV data, and strong, imbalance-aware COVID-19 detection performance, illustrating the versatility of attention-based pooling across diverse speech-characterization problems.

Abstract

State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self-Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.

Speaker Characterization by means of Attention Pooling

TL;DR

This work extends Double Multi-Head Self-Attention (DMHSA) pooling for speaker verification and demonstrates its applicability to speaker characterization tasks such as emotion recognition, sex classification, and COVID-19 detection. The DMHSA architecture sits between a CNN front-end and fully connected layers, learning robust speaker embeddings from variable-length inputs via multi-head attention across temporal and head-wise dimensions. In SV, DMHSA outperforms vanilla Self-Attention and MHSA pooling on VoxCeleb benchmarks, with the best 32-head configuration achieving notable EER reductions. For SC tasks, DMHSA shows competitive SER performance, excellent sex-classification accuracy on Catalan CV data, and strong, imbalance-aware COVID-19 detection performance, illustrating the versatility of attention-based pooling across diverse speech-characterization problems.

Abstract

State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self-Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.
Paper Structure (20 sections, 4 equations, 1 figure, 4 tables)