Table of Contents
Fetching ...

Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings

Jakaria Islam Emon, Md Abu Salek, Kazi Tamanna Alam

TL;DR

This work tackles multilingual speaker identification by reusing a pre-trained multilingual Whisper encoder to produce robust speaker embeddings. The authors introduce WSI, a framework that jointly optimizes an online hard triplet loss with a self-supervised NT-Xent loss, using two augmented views to enforce representation consistency. By fine-tuning the Whisper encoder paired with a projection head, WSI achieves strong open-set speaker discriminability across languages and recording conditions, outperforming Pyannote Embedding, ECAPA-TDNN, and X-vector on VoxTube, JVS, CallHome, and VoxConverse datasets. The results validate that language-agnostic acoustic representations, when combined with targeted metric-learning objectives, yield superior cross-language speaker identification performance with practical implications for multilingual voice applications.

Abstract

Speaker identification in multilingual settings presents unique challenges, particularly when conventional models are predominantly trained on English data. In this paper, we propose WSI (Whisper Speaker Identification), a framework that repurposes the encoder of the Whisper automatic speech recognition model pre trained on extensive multilingual data to generate robust speaker embeddings via a joint loss optimization strategy that leverages online hard triplet mining and self supervised Normalized Temperature-scaled Cross Entropy loss. By capitalizing on Whisper language-agnostic acoustic representations, our approach effectively distinguishes speakers across diverse languages and recording conditions. Extensive evaluations on multiple corpora, including VoxTube (multilingual), JVS (Japanese), CallHome (German, Spanish, Chinese, and Japanese), and Voxconverse (English), demonstrate that WSI consistently outperforms state-of-the-art baselines, namely Pyannote Embedding, ECAPA TDNN, and Xvector, in terms of lower equal error rates and higher AUC scores. These results validate our hypothesis that a multilingual pre-trained ASR encoder, combined with joint loss optimization, substantially improves speaker identification performance in non-English languages.

Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings

TL;DR

This work tackles multilingual speaker identification by reusing a pre-trained multilingual Whisper encoder to produce robust speaker embeddings. The authors introduce WSI, a framework that jointly optimizes an online hard triplet loss with a self-supervised NT-Xent loss, using two augmented views to enforce representation consistency. By fine-tuning the Whisper encoder paired with a projection head, WSI achieves strong open-set speaker discriminability across languages and recording conditions, outperforming Pyannote Embedding, ECAPA-TDNN, and X-vector on VoxTube, JVS, CallHome, and VoxConverse datasets. The results validate that language-agnostic acoustic representations, when combined with targeted metric-learning objectives, yield superior cross-language speaker identification performance with practical implications for multilingual voice applications.

Abstract

Speaker identification in multilingual settings presents unique challenges, particularly when conventional models are predominantly trained on English data. In this paper, we propose WSI (Whisper Speaker Identification), a framework that repurposes the encoder of the Whisper automatic speech recognition model pre trained on extensive multilingual data to generate robust speaker embeddings via a joint loss optimization strategy that leverages online hard triplet mining and self supervised Normalized Temperature-scaled Cross Entropy loss. By capitalizing on Whisper language-agnostic acoustic representations, our approach effectively distinguishes speakers across diverse languages and recording conditions. Extensive evaluations on multiple corpora, including VoxTube (multilingual), JVS (Japanese), CallHome (German, Spanish, Chinese, and Japanese), and Voxconverse (English), demonstrate that WSI consistently outperforms state-of-the-art baselines, namely Pyannote Embedding, ECAPA TDNN, and Xvector, in terms of lower equal error rates and higher AUC scores. These results validate our hypothesis that a multilingual pre-trained ASR encoder, combined with joint loss optimization, substantially improves speaker identification performance in non-English languages.

Paper Structure

This paper contains 11 sections, 10 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: WSI Architecture
  • Figure 2: EER Across Methods and Datasets.