Table of Contents
Fetching ...

Speaker Embedding Informed Audiovisual Active Speaker Detection for Egocentric Recordings

Jason Clarke, Yoshihiko Gotoh, Stefan Goetze

TL;DR

The paper tackles audiovisual active speaker detection in egocentric recordings, where visual cues are often unreliable due to occlusions and motion. It introduces SCAN, a cross-attention module that compares speaker embeddings from reference speech and the candidate audio to disambiguate speaking-active frames, and complements this with a self-supervised VBFR-based method to enroll robust identity-speech libraries. The dual contributions—SCAN and self-supervised identity-library generation—yield substantial improvements on the Ego4D dataset (and smaller yet meaningful gains on AVA-ActiveSpeaker) when integrated with existing baselines like TalkNet and Light-ASD, narrowing the gap to state-of-the-art performance in challenging wearable-camera scenarios. Overall, the approach enhances robustness and speaker attribution in egocentric ASD, with practical implications for diarisation and real-time speaker tracking in mobile and wearable settings.

Abstract

Audiovisual active speaker detection (ASD) addresses the task of determining the speech activity of a candidate speaker given acoustic and visual data. Typically, systems model the temporal correspondence of audiovisual cues, such as the synchronisation between speech and lip movement. Recent work has explored extending this paradigm by additionally leveraging speaker embeddings extracted from candidate speaker reference speech. This paper proposes the speaker comparison auxiliary network (SCAN) which uses speaker-specific information from both reference speech and the candidate audio signal to disambiguate challenging scenes when the visual signal is unresolvable. Furthermore, an improved method for enrolling face-speaker libraries is developed, which implements a self-supervised approach to video-based face recognition. Fitting with the recent proliferation of wearable devices, this work focuses on improving speaker-embedding-informed ASD in the context of egocentric recordings, which can be characterised by acoustic noise and highly dynamic scenes. SCAN is implemented with two well-established baselines, namely TalkNet and Light-ASD; yielding a relative improvement in mAP of 14.5% and 10.3% on the Ego4D benchmark, respectively.

Speaker Embedding Informed Audiovisual Active Speaker Detection for Egocentric Recordings

TL;DR

The paper tackles audiovisual active speaker detection in egocentric recordings, where visual cues are often unreliable due to occlusions and motion. It introduces SCAN, a cross-attention module that compares speaker embeddings from reference speech and the candidate audio to disambiguate speaking-active frames, and complements this with a self-supervised VBFR-based method to enroll robust identity-speech libraries. The dual contributions—SCAN and self-supervised identity-library generation—yield substantial improvements on the Ego4D dataset (and smaller yet meaningful gains on AVA-ActiveSpeaker) when integrated with existing baselines like TalkNet and Light-ASD, narrowing the gap to state-of-the-art performance in challenging wearable-camera scenarios. Overall, the approach enhances robustness and speaker attribution in egocentric ASD, with practical implications for diarisation and real-time speaker tracking in mobile and wearable settings.

Abstract

Audiovisual active speaker detection (ASD) addresses the task of determining the speech activity of a candidate speaker given acoustic and visual data. Typically, systems model the temporal correspondence of audiovisual cues, such as the synchronisation between speech and lip movement. Recent work has explored extending this paradigm by additionally leveraging speaker embeddings extracted from candidate speaker reference speech. This paper proposes the speaker comparison auxiliary network (SCAN) which uses speaker-specific information from both reference speech and the candidate audio signal to disambiguate challenging scenes when the visual signal is unresolvable. Furthermore, an improved method for enrolling face-speaker libraries is developed, which implements a self-supervised approach to video-based face recognition. Fitting with the recent proliferation of wearable devices, this work focuses on improving speaker-embedding-informed ASD in the context of egocentric recordings, which can be characterised by acoustic noise and highly dynamic scenes. SCAN is implemented with two well-established baselines, namely TalkNet and Light-ASD; yielding a relative improvement in mAP of 14.5% and 10.3% on the Ego4D benchmark, respectively.

Paper Structure

This paper contains 15 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Example of typical false-positive ASD: a) input audio signal; b) ground truth speaker activity of the candidate speaker (inactive throughout) and hypothesised speaker activity by a state-of-the-art speaker-embedding naive ASD system Liao_2023_CVPR; c) selection of challenging video frames from a typical egocentric video track Ego4D.
  • Figure 2: SCAN is shown in the top box which leverages speaker-specific information for framewise comparison of reference speech and input audio signal via cross-attention. The bottom box shows a typical ASD architectural design (baseline). Dotted connections represent non-end-to-end passages in the framework.
  • Figure 3: Self-supervised video-based face recognition model. impostor frames are randomly inserted into the parent track, resulting in polluted track $\mathcal{V}_{\mathrm{S}}'$. The training objective involves the model classifying frames as either native or impostor frames with respect to the parent track. $\diameter$ denotes mean average.
  • Figure 4: Similarity between same-identity embeddings and different-identity embeddings shown in green and red, respectively, for Ego4D validation fold Ego4D