Table of Contents
Fetching ...

DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

Federico Landini, Mireia Diez, Themos Stafylakis, Lukáš Burget

TL;DR

DiaPer replaces the encoder-decoder attractors in end-to-end neural diarization with a Perceiver-based attractor decoder to handle variable speaker counts in a non-autoregressive manner. The model uses a frame encoder to produce embeddings and a Latent-based Perceiver decoder to form a fixed set of attractors, combined with conditioning and auxiliary losses to stabilize training. Extensive experiments across 8 kHz and 16 kHz data, Callhome, DIHARD3, and broad wide-band corpora show DiaPer achieving competitive DERs with substantially fewer parameters than many baselines, while highlighting the impact of design choices and training data on cross-domain performance. The work also releases code and models to support reproducible research and future improvements in light-weight, end-to-end diarization systems.

Abstract

Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and faster inference time. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.

DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

TL;DR

DiaPer replaces the encoder-decoder attractors in end-to-end neural diarization with a Perceiver-based attractor decoder to handle variable speaker counts in a non-autoregressive manner. The model uses a frame encoder to produce embeddings and a Latent-based Perceiver decoder to form a fixed set of attractors, combined with conditioning and auxiliary losses to stabilize training. Extensive experiments across 8 kHz and 16 kHz data, Callhome, DIHARD3, and broad wide-band corpora show DiaPer achieving competitive DERs with substantially fewer parameters than many baselines, while highlighting the impact of design choices and training data on cross-domain performance. The work also releases code and models to support reproducible research and future improvements in light-weight, end-to-end diarization systems.

Abstract

Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and faster inference time. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.
Paper Structure (17 sections, 7 equations, 10 figures, 13 tables)

This paper contains 17 sections, 7 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: DiaPer diagram. $\sigma$ refers to the sigmoid function and the circles with crosses mean dot-product between the vectors.
  • Figure 2: Scheme of frame encoder (middle), detail of self-attention layer (left) and conditioning scheme (right).
  • Figure 3: Scheme of Perceiver decoder.
  • Figure 4: Performance on CH1-2spk for different model dimensions (latents, frame embeddings and attractors).
  • Figure 5: DER (%) for telephone recordings of Callhome and DIHARD 3 conversational telephone speech (CTS) with 2 speakers.
  • ...and 5 more figures