DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors
Federico Landini, Mireia Diez, Themos Stafylakis, Lukáš Burget
TL;DR
DiaPer replaces the encoder-decoder attractors in end-to-end neural diarization with a Perceiver-based attractor decoder to handle variable speaker counts in a non-autoregressive manner. The model uses a frame encoder to produce embeddings and a Latent-based Perceiver decoder to form a fixed set of attractors, combined with conditioning and auxiliary losses to stabilize training. Extensive experiments across 8 kHz and 16 kHz data, Callhome, DIHARD3, and broad wide-band corpora show DiaPer achieving competitive DERs with substantially fewer parameters than many baselines, while highlighting the impact of design choices and training data on cross-domain performance. The work also releases code and models to support reproducible research and future improvements in light-weight, end-to-end diarization systems.
Abstract
Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and faster inference time. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.
