Table of Contents
Fetching ...

Mamba-based Segmentation Model for Speaker Diarization

Alexis Plaquet, Naohiro Tawara, Marc Delcroix, Shota Horiguchi, Atsushi Ando, Shoko Araki

TL;DR

The proposed Mamba-based system achieves state-of-the-art performance on three widely used diarization datasets and is found to be a superior alternative to both traditional RNN and the tested attention-based model.

Abstract

Mamba is a newly proposed architecture which behaves like a recurrent neural network (RNN) with attention-like capabilities. These properties are promising for speaker diarization, as attention-based models have unsuitable memory requirements for long-form audio, and traditional RNN capabilities are too limited. In this paper, we propose to assess the potential of Mamba for diarization by comparing the state-of-the-art neural segmentation of the pyannote pipeline with our proposed Mamba-based variant. Mamba's stronger processing capabilities allow usage of longer local windows, which significantly improve diarization quality by making the speaker embedding extraction more reliable. We find Mamba to be a superior alternative to both traditional RNN and the tested attention-based model. Our proposed Mamba-based system achieves state-of-the-art performance on three widely used diarization datasets.

Mamba-based Segmentation Model for Speaker Diarization

TL;DR

The proposed Mamba-based system achieves state-of-the-art performance on three widely used diarization datasets and is found to be a superior alternative to both traditional RNN and the tested attention-based model.

Abstract

Mamba is a newly proposed architecture which behaves like a recurrent neural network (RNN) with attention-like capabilities. These properties are promising for speaker diarization, as attention-based models have unsuitable memory requirements for long-form audio, and traditional RNN capabilities are too limited. In this paper, we propose to assess the potential of Mamba for diarization by comparing the state-of-the-art neural segmentation of the pyannote pipeline with our proposed Mamba-based variant. Mamba's stronger processing capabilities allow usage of longer local windows, which significantly improve diarization quality by making the speaker embedding extraction more reliable. We find Mamba to be a superior alternative to both traditional RNN and the tested attention-based model. Our proposed Mamba-based system achieves state-of-the-art performance on three widely used diarization datasets.
Paper Structure (19 sections, 2 figures, 2 tables)

This paper contains 19 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Local EEND architecture. The activation is respectively sigmoid and softmax for the multilabel and multiclass outputs.
  • Figure 2: Oracle clustering DER as a function of window size for each architecture. Sliding windows do not overlap.