Table of Contents
Fetching ...

EEND-M2F: Masked-attention mask transformers for speaker diarization

Marc Härkönen, Samuel J. Broughton, Lahiru Samarakoon

TL;DR

We address the challenge of speaker diarization by reframing it as a segmentation-like, end-to-end problem and proposing EEND-M2F, a Mask2Former–inspired model. The architecture uses a Conformer backbone, a mask module to produce speaker masks, and a stack of transformer decoders with masked cross-attention, guided by Hungarian set-prediction loss and deep supervision. The approach achieves state-of-the-art DER on multiple datasets (e.g., DIHARD-III with DER $=16.07\%$) without auxiliary diarization components or clustering, and runs efficiently with about $16.3$ million parameters. While performing strongly on datasets with few speakers, the method shows room for improvement on very high-speaker counts and domain-specific channels, suggesting future work on non-parametric queries and prompt-based generalization.

Abstract

In this paper, we make the explicit connection between image segmentation methods and end-to-end diarization methods. From these insights, we propose a novel, fully end-to-end diarization model, EEND-M2F, based on the Mask2Former architecture. Speaker representations are computed in parallel using a stack of transformer decoders, in which irrelevant frames are explicitly masked from the cross attention using predictions from previous layers. EEND-M2F is lightweight, efficient, and truly end-to-end, as it does not require any additional diarization, speaker verification, or segmentation models to run, nor does it require running any clustering algorithms. Our model achieves state-of-the-art performance on several public datasets, such as AMI, AliMeeting and RAMC. Most notably our DER of 16.07% on DIHARD-III is the first major improvement upon the challenge winning system.

EEND-M2F: Masked-attention mask transformers for speaker diarization

TL;DR

We address the challenge of speaker diarization by reframing it as a segmentation-like, end-to-end problem and proposing EEND-M2F, a Mask2Former–inspired model. The architecture uses a Conformer backbone, a mask module to produce speaker masks, and a stack of transformer decoders with masked cross-attention, guided by Hungarian set-prediction loss and deep supervision. The approach achieves state-of-the-art DER on multiple datasets (e.g., DIHARD-III with DER ) without auxiliary diarization components or clustering, and runs efficiently with about million parameters. While performing strongly on datasets with few speakers, the method shows room for improvement on very high-speaker counts and domain-specific channels, suggesting future work on non-parametric queries and prompt-based generalization.

Abstract

In this paper, we make the explicit connection between image segmentation methods and end-to-end diarization methods. From these insights, we propose a novel, fully end-to-end diarization model, EEND-M2F, based on the Mask2Former architecture. Speaker representations are computed in parallel using a stack of transformer decoders, in which irrelevant frames are explicitly masked from the cross attention using predictions from previous layers. EEND-M2F is lightweight, efficient, and truly end-to-end, as it does not require any additional diarization, speaker verification, or segmentation models to run, nor does it require running any clustering algorithms. Our model achieves state-of-the-art performance on several public datasets, such as AMI, AliMeeting and RAMC. Most notably our DER of 16.07% on DIHARD-III is the first major improvement upon the challenge winning system.
Paper Structure (27 sections, 1 theorem, 15 equations, 2 figures, 7 tables)

This paper contains 27 sections, 1 theorem, 15 equations, 2 figures, 7 tables.

Key Result

Proposition 1.1

The algorithm in code:DER computes the optimal diarization error rate.

Figures (2)

  • Figure 1: Overview of EEND-M2F. An encoder backbone, depicted in green, processes the input audio into low and high resolution acoustic features. A set of learnable queries are iteratively refined using transformer decoders. A mask module (MM, in red) generates speaker-wise masks from queries and acoustic features. These masks correspond to diarization predictions, and they are also used to mask the acoustic features in the transformer decoder. Finally, an MLP determines which queries correspond to actual speakers.
  • Figure 2: Example of the Hungarian matching procedure with $N=5$ queries (1,2,3,4,5) and $S=3$ speakers (A,B,C). A matching is an assignment of each of the $S$ speakers to a unique query. The cost matrix consists of pairwise matching costs, and the optimal matching $\phi^*$ is the one minimizing the sum of $S$ costs, where at most one is chosen from each row and each column.

Theorems & Definitions (2)

  • Proposition 1.1
  • proof