Table of Contents
Fetching ...

Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios

Juan Ignacio Alvarez-Trejos, Beltrán Labrador, Alicia Lozano-Diez

TL;DR

This work addresses two-speaker diarization by enriching end-to-end EEND-EDA with pre-trained ECAPA-TDNN speaker embeddings. It evaluates three integration strategies—embedding into the EDA module, embedding into the SA-EEND encoder, and concatenation of embeddings with acoustic features—plus vigilant silence handling through oracle and external VAD. The results show that embedding–feature concatenation, especially when adapted to CallHome data, yields substantial DER reductions relative to the baseline, with the best setup achieving around a $7.2\%$ DER under oracle VAD and $7.6\%$ with external VAD, marking a notable improvement in end-to-end diarization for two-speaker scenarios. The findings highlight the importance of careful silence treatment and embedding window sizing, suggesting that speaker embeddings can meaningfully enhance diarization without sacrificing overlap-handling capabilities in end-to-end models.

Abstract

End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap. This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities, while maintaining their overlap handling strengths. To achieve this, we propose several methods for incorporating these embeddings along the acoustic features. Furthermore, we delve into an analysis of the correct handling of silence frames, the window length for extracting speaker embeddings and the transformer encoder size. The effectiveness of our proposed approach is thoroughly evaluated on the CallHome dataset for the two-speaker diarization task, with results that demonstrate a significant reduction in diarization error rates achieving a relative improvement of a 10.78% compared to the baseline end-to-end model.

Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios

TL;DR

This work addresses two-speaker diarization by enriching end-to-end EEND-EDA with pre-trained ECAPA-TDNN speaker embeddings. It evaluates three integration strategies—embedding into the EDA module, embedding into the SA-EEND encoder, and concatenation of embeddings with acoustic features—plus vigilant silence handling through oracle and external VAD. The results show that embedding–feature concatenation, especially when adapted to CallHome data, yields substantial DER reductions relative to the baseline, with the best setup achieving around a DER under oracle VAD and with external VAD, marking a notable improvement in end-to-end diarization for two-speaker scenarios. The findings highlight the importance of careful silence treatment and embedding window sizing, suggesting that speaker embeddings can meaningfully enhance diarization without sacrificing overlap-handling capabilities in end-to-end models.

Abstract

End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap. This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities, while maintaining their overlap handling strengths. To achieve this, we propose several methods for incorporating these embeddings along the acoustic features. Furthermore, we delve into an analysis of the correct handling of silence frames, the window length for extracting speaker embeddings and the transformer encoder size. The effectiveness of our proposed approach is thoroughly evaluated on the CallHome dataset for the two-speaker diarization task, with results that demonstrate a significant reduction in diarization error rates achieving a relative improvement of a 10.78% compared to the baseline end-to-end model.
Paper Structure (18 sections, 11 equations, 2 figures, 6 tables)

This paper contains 18 sections, 11 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Architecture of the EEND-EDA model with our different proposed methods to integrate speaker embeddings into the system: a) Speaker embeddings into EDA module described in section \ref{['method_a']}, b) Speaker embeddings into SA-EEND encoder described in section \ref{['method_b']}, and c) Concatenation of speaker embeddings and MFbank into SA-EEND encoder as described in section \ref{['method_c']}.
  • Figure 2: TSNE Visualization of embeddings for conventional SA-EEND-EDA and our best approach. a) Audio from simulated data, b) iafi recording, c) ialq recording.