EEND-M2F: Masked-attention mask transformers for speaker diarization
Marc Härkönen, Samuel J. Broughton, Lahiru Samarakoon
TL;DR
We address the challenge of speaker diarization by reframing it as a segmentation-like, end-to-end problem and proposing EEND-M2F, a Mask2Former–inspired model. The architecture uses a Conformer backbone, a mask module to produce speaker masks, and a stack of transformer decoders with masked cross-attention, guided by Hungarian set-prediction loss and deep supervision. The approach achieves state-of-the-art DER on multiple datasets (e.g., DIHARD-III with DER $=16.07\%$) without auxiliary diarization components or clustering, and runs efficiently with about $16.3$ million parameters. While performing strongly on datasets with few speakers, the method shows room for improvement on very high-speaker counts and domain-specific channels, suggesting future work on non-parametric queries and prompt-based generalization.
Abstract
In this paper, we make the explicit connection between image segmentation methods and end-to-end diarization methods. From these insights, we propose a novel, fully end-to-end diarization model, EEND-M2F, based on the Mask2Former architecture. Speaker representations are computed in parallel using a stack of transformer decoders, in which irrelevant frames are explicitly masked from the cross attention using predictions from previous layers. EEND-M2F is lightweight, efficient, and truly end-to-end, as it does not require any additional diarization, speaker verification, or segmentation models to run, nor does it require running any clustering algorithms. Our model achieves state-of-the-art performance on several public datasets, such as AMI, AliMeeting and RAMC. Most notably our DER of 16.07% on DIHARD-III is the first major improvement upon the challenge winning system.
