Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

Lin Zhang; Themos Stafylakis; Federico Landini; Mireia Diez; Anna Silnova; Lukáš Burget

Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

Lin Zhang, Themos Stafylakis, Federico Landini, Mireia Diez, Anna Silnova, Lukáš Burget

TL;DR

The paper uses the variational information bottleneck (VIB) to probe what information must be encoded in frame embeddings and attractors in end-to-end neural diarization (EEND-EDA). It shows that attractors do not need to encode speaker identities, and that strong regularization preserving only coarse information still yields competitive diarization performance, with small gains possible when attractors retain some speaker-specific content. The findings suggest that attractors act more as anchors for counting speakers rather than preserving individual identities, with broader implications for privacy-aware and resource-efficient diarization. The approach and results are applicable to other EEND variants, offering practical guidance for design choices and potential privacy-preserving extensions.

Abstract

In this paper, we apply the variational information bottleneck approach to end-to-end neural diarization with encoder-decoder attractors (EEND-EDA). This allows us to investigate what information is essential for the model. EEND-EDA utilizes attractors, vector representations of speakers in a conversation. Our analysis shows that, attractors do not necessarily have to contain speaker characteristic information. On the other hand, giving the attractors more freedom to allow them to encode some extra (possibly speaker-specific) information leads to small but consistent diarization performance improvements. Despite architectural differences in EEND systems, the notion of attractors and frame embeddings is common to most of them and not specific to EEND-EDA. We believe that the main conclusions of this work can apply to other variants of EEND. Thus, we hope this paper will be a valuable contribution to guide the community to make more informed decisions when designing new systems.

Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

TL;DR

Abstract

Paper Structure (22 sections, 15 equations, 10 figures, 2 tables)

This paper contains 22 sections, 15 equations, 10 figures, 2 tables.

Introduction
EEND-EDA with VIB
EEND with Encoder-Decoder-Based Attractors
Variational Information Bottleneck
VIB model and objective
VIB training
VIB regularized classifier
EEND-EDA regularized using VIB
Experimental Setup
Data
Configurations
Results and Discussions
Impact of the weight of the KLD loss
Number of samples and selection of the permutation $\phi$
Visualization of attractors and frame embeddings
...and 7 more sections

Figures (10)

Figure 1: Comparison between the original EEND-EDA model and EEND-EDA with VIB. Numbers within [] present dimensions. M is the sampling number, B is batch size, T is number of frames, D is feature dimension, and S is the maximum speaker number within one mini-batch. In (a), four additional FC layers in the dashed box are introduced to make the original EEND-EDA comparable with (b).
Figure 2: DER (%) in CH1-2spk for different values of $\beta_e$ and $\beta_a$. EEND-EDA + 4FC with five runs are shown as baselines within the gray shade, and the dashed line represents the mean of those five runs (overlapped due to the small variance).
Figure 3: Visualization of attractors and frame embeddings as Gaussian distributions, after projecting them to two dimensions using PCA (DERs are in %). In subfigure (b), colors represent different audio files; line styles denote individual speakers, overlap, and silence. Note that in subfigures (0) the attractors and frame embeddings are deterministic and are therefore represented by dots.
Figure 4: Decomposed visualization of Fig. \ref{['fig:visualization_gaussian_e']}.4 ($\beta_a=0, \beta_e=10^{1}$). (i) two single speakers, and (ii) overlap and silence.
Figure 5: Visualization of attractors (left) and frame embeddings (right) when $\beta_a = \beta_e=10^{1}$.
...and 5 more figures

Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

TL;DR

Abstract

Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

Authors

TL;DR

Abstract

Table of Contents

Figures (10)