Speaker Embedding Informed Audiovisual Active Speaker Detection for Egocentric Recordings
Jason Clarke, Yoshihiko Gotoh, Stefan Goetze
TL;DR
The paper tackles audiovisual active speaker detection in egocentric recordings, where visual cues are often unreliable due to occlusions and motion. It introduces SCAN, a cross-attention module that compares speaker embeddings from reference speech and the candidate audio to disambiguate speaking-active frames, and complements this with a self-supervised VBFR-based method to enroll robust identity-speech libraries. The dual contributions—SCAN and self-supervised identity-library generation—yield substantial improvements on the Ego4D dataset (and smaller yet meaningful gains on AVA-ActiveSpeaker) when integrated with existing baselines like TalkNet and Light-ASD, narrowing the gap to state-of-the-art performance in challenging wearable-camera scenarios. Overall, the approach enhances robustness and speaker attribution in egocentric ASD, with practical implications for diarisation and real-time speaker tracking in mobile and wearable settings.
Abstract
Audiovisual active speaker detection (ASD) addresses the task of determining the speech activity of a candidate speaker given acoustic and visual data. Typically, systems model the temporal correspondence of audiovisual cues, such as the synchronisation between speech and lip movement. Recent work has explored extending this paradigm by additionally leveraging speaker embeddings extracted from candidate speaker reference speech. This paper proposes the speaker comparison auxiliary network (SCAN) which uses speaker-specific information from both reference speech and the candidate audio signal to disambiguate challenging scenes when the visual signal is unresolvable. Furthermore, an improved method for enrolling face-speaker libraries is developed, which implements a self-supervised approach to video-based face recognition. Fitting with the recent proliferation of wearable devices, this work focuses on improving speaker-embedding-informed ASD in the context of egocentric recordings, which can be characterised by acoustic noise and highly dynamic scenes. SCAN is implemented with two well-established baselines, namely TalkNet and Light-ASD; yielding a relative improvement in mAP of 14.5% and 10.3% on the Ego4D benchmark, respectively.
