Table of Contents
Fetching ...

Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

Lin Zhang, Themos Stafylakis, Federico Landini, Mireia Diez, Anna Silnova, Lukáš Burget

TL;DR

The paper uses the variational information bottleneck (VIB) to probe what information must be encoded in frame embeddings and attractors in end-to-end neural diarization (EEND-EDA). It shows that attractors do not need to encode speaker identities, and that strong regularization preserving only coarse information still yields competitive diarization performance, with small gains possible when attractors retain some speaker-specific content. The findings suggest that attractors act more as anchors for counting speakers rather than preserving individual identities, with broader implications for privacy-aware and resource-efficient diarization. The approach and results are applicable to other EEND variants, offering practical guidance for design choices and potential privacy-preserving extensions.

Abstract

In this paper, we apply the variational information bottleneck approach to end-to-end neural diarization with encoder-decoder attractors (EEND-EDA). This allows us to investigate what information is essential for the model. EEND-EDA utilizes attractors, vector representations of speakers in a conversation. Our analysis shows that, attractors do not necessarily have to contain speaker characteristic information. On the other hand, giving the attractors more freedom to allow them to encode some extra (possibly speaker-specific) information leads to small but consistent diarization performance improvements. Despite architectural differences in EEND systems, the notion of attractors and frame embeddings is common to most of them and not specific to EEND-EDA. We believe that the main conclusions of this work can apply to other variants of EEND. Thus, we hope this paper will be a valuable contribution to guide the community to make more informed decisions when designing new systems.

Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

TL;DR

The paper uses the variational information bottleneck (VIB) to probe what information must be encoded in frame embeddings and attractors in end-to-end neural diarization (EEND-EDA). It shows that attractors do not need to encode speaker identities, and that strong regularization preserving only coarse information still yields competitive diarization performance, with small gains possible when attractors retain some speaker-specific content. The findings suggest that attractors act more as anchors for counting speakers rather than preserving individual identities, with broader implications for privacy-aware and resource-efficient diarization. The approach and results are applicable to other EEND variants, offering practical guidance for design choices and potential privacy-preserving extensions.

Abstract

In this paper, we apply the variational information bottleneck approach to end-to-end neural diarization with encoder-decoder attractors (EEND-EDA). This allows us to investigate what information is essential for the model. EEND-EDA utilizes attractors, vector representations of speakers in a conversation. Our analysis shows that, attractors do not necessarily have to contain speaker characteristic information. On the other hand, giving the attractors more freedom to allow them to encode some extra (possibly speaker-specific) information leads to small but consistent diarization performance improvements. Despite architectural differences in EEND systems, the notion of attractors and frame embeddings is common to most of them and not specific to EEND-EDA. We believe that the main conclusions of this work can apply to other variants of EEND. Thus, we hope this paper will be a valuable contribution to guide the community to make more informed decisions when designing new systems.
Paper Structure (22 sections, 15 equations, 10 figures, 2 tables)

This paper contains 22 sections, 15 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Comparison between the original EEND-EDA model and EEND-EDA with VIB. Numbers within [] present dimensions. M is the sampling number, B is batch size, T is number of frames, D is feature dimension, and S is the maximum speaker number within one mini-batch. In (a), four additional FC layers in the dashed box are introduced to make the original EEND-EDA comparable with (b).
  • Figure 2: DER (%) in CH1-2spk for different values of $\beta_e$ and $\beta_a$. EEND-EDA + 4FC with five runs are shown as baselines within the gray shade, and the dashed line represents the mean of those five runs (overlapped due to the small variance).
  • Figure 3: Visualization of attractors and frame embeddings as Gaussian distributions, after projecting them to two dimensions using PCA (DERs are in %). In subfigure (b), colors represent different audio files; line styles denote individual speakers, overlap, and silence. Note that in subfigures (0) the attractors and frame embeddings are deterministic and are therefore represented by dots.
  • Figure 4: Decomposed visualization of Fig. \ref{['fig:visualization_gaussian_e']}.4 ($\beta_a=0, \beta_e=10^{1}$). (i) two single speakers, and (ii) overlap and silence.
  • Figure 5: Visualization of attractors (left) and frame embeddings (right) when $\beta_a = \beta_e=10^{1}$.
  • ...and 5 more figures