Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?
Lin Zhang, Themos Stafylakis, Federico Landini, Mireia Diez, Anna Silnova, Lukáš Burget
TL;DR
The paper uses the variational information bottleneck (VIB) to probe what information must be encoded in frame embeddings and attractors in end-to-end neural diarization (EEND-EDA). It shows that attractors do not need to encode speaker identities, and that strong regularization preserving only coarse information still yields competitive diarization performance, with small gains possible when attractors retain some speaker-specific content. The findings suggest that attractors act more as anchors for counting speakers rather than preserving individual identities, with broader implications for privacy-aware and resource-efficient diarization. The approach and results are applicable to other EEND variants, offering practical guidance for design choices and potential privacy-preserving extensions.
Abstract
In this paper, we apply the variational information bottleneck approach to end-to-end neural diarization with encoder-decoder attractors (EEND-EDA). This allows us to investigate what information is essential for the model. EEND-EDA utilizes attractors, vector representations of speakers in a conversation. Our analysis shows that, attractors do not necessarily have to contain speaker characteristic information. On the other hand, giving the attractors more freedom to allow them to encode some extra (possibly speaker-specific) information leads to small but consistent diarization performance improvements. Despite architectural differences in EEND systems, the notion of attractors and frame embeddings is common to most of them and not specific to EEND-EDA. We believe that the main conclusions of this work can apply to other variants of EEND. Thus, we hope this paper will be a valuable contribution to guide the community to make more informed decisions when designing new systems.
