Table of Contents
Fetching ...

LoCoNet: Long-Short Context Network for Active Speaker Detection

Xizi Wang, Feng Cheng, Gedas Bertasius, David Crandall

TL;DR

LoCoNet addresses Active Speaker Detection by jointly modeling long-term intra-speaker context and short-term inter-speaker context. It combines an audio-visual encoder with a Long-Short Context Modeling (LSCM) module that interleaves LIM (self- and cross-attention for a single speaker over time) and SIM (inter-speaker convolution over nearby frames and speakers). The approach achieves state-of-the-art results across AVA-ActiveSpeaker, Talkies, and Ego4D, notably excelling in challenging multi-speaker and small-face scenarios, while maintaining efficient parallel inference. The work demonstrates that explicit long-range intra-speaker and local inter-speaker cues provide strong, complementary signals for ASD, and releases code for replication and extension.

Abstract

Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. ASD reasons from audio and visual information from two contexts: long-term intra-speaker context and short-term inter-speaker context. Long-term intra-speaker context models the temporal dependencies of the same speaker, while short-term inter-speaker context models the interactions of speakers in the same scene. These two contexts are complementary to each other and can help infer the active speaker. Motivated by these observations, we propose LoCoNet, a simple yet effective Long-Short Context Network that models the long-term intra-speaker context and short-term inter-speaker context. We use self-attention to model long-term intra-speaker context due to its effectiveness in modeling long-range dependencies, and convolutional blocks that capture local patterns to model short-term inter-speaker context. Extensive experiments show that LoCoNet achieves state-of-the-art performance on multiple datasets, achieving an mAP of 95.2%(+1.1%) on AVA-ActiveSpeaker, 68.1%(+22%) on Columbia dataset, 97.2%(+2.8%) on Talkies dataset and 59.7%(+8.0%) on Ego4D dataset. Moreover, in challenging cases where multiple speakers are present, or face of active speaker is much smaller than other faces in the same scene, LoCoNet outperforms previous state-of-the-art methods by 3.4% on the AVA-ActiveSpeaker dataset. The code will be released at https://github.com/SJTUwxz/LoCoNet_ASD.

LoCoNet: Long-Short Context Network for Active Speaker Detection

TL;DR

LoCoNet addresses Active Speaker Detection by jointly modeling long-term intra-speaker context and short-term inter-speaker context. It combines an audio-visual encoder with a Long-Short Context Modeling (LSCM) module that interleaves LIM (self- and cross-attention for a single speaker over time) and SIM (inter-speaker convolution over nearby frames and speakers). The approach achieves state-of-the-art results across AVA-ActiveSpeaker, Talkies, and Ego4D, notably excelling in challenging multi-speaker and small-face scenarios, while maintaining efficient parallel inference. The work demonstrates that explicit long-range intra-speaker and local inter-speaker cues provide strong, complementary signals for ASD, and releases code for replication and extension.

Abstract

Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. ASD reasons from audio and visual information from two contexts: long-term intra-speaker context and short-term inter-speaker context. Long-term intra-speaker context models the temporal dependencies of the same speaker, while short-term inter-speaker context models the interactions of speakers in the same scene. These two contexts are complementary to each other and can help infer the active speaker. Motivated by these observations, we propose LoCoNet, a simple yet effective Long-Short Context Network that models the long-term intra-speaker context and short-term inter-speaker context. We use self-attention to model long-term intra-speaker context due to its effectiveness in modeling long-range dependencies, and convolutional blocks that capture local patterns to model short-term inter-speaker context. Extensive experiments show that LoCoNet achieves state-of-the-art performance on multiple datasets, achieving an mAP of 95.2%(+1.1%) on AVA-ActiveSpeaker, 68.1%(+22%) on Columbia dataset, 97.2%(+2.8%) on Talkies dataset and 59.7%(+8.0%) on Ego4D dataset. Moreover, in challenging cases where multiple speakers are present, or face of active speaker is much smaller than other faces in the same scene, LoCoNet outperforms previous state-of-the-art methods by 3.4% on the AVA-ActiveSpeaker dataset. The code will be released at https://github.com/SJTUwxz/LoCoNet_ASD.
Paper Structure (19 sections, 3 equations, 6 figures, 11 tables)

This paper contains 19 sections, 3 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Comparison of ASD methods in terms of mean average precision (mAP) on the AVA-ActiveSpeaker dataset, average-FLOPs, and number of parameters. Note that average-FLOPs is the computation required to predict the speaking activity of one face crop.
  • Figure 2: Long-term Intra-speaker Modeling (LIM), Short-term Inter-speaker Modeling (SIM), and comparison of LoCoNet with existing long-term parallel-inference ASD methods. Red boxes show inactive speakers and green boxes show active speakers. LIM uses the features of a single speaker across all frames to capture long-term relationships. SIM models the relationships of speakers within a short $m$-frame segment to capture the conversation pattern. The speaker context modeling of the existing long-term parallel-inference ASD methods tao2021someoneliao2023lightdatta2022asd only focuses on LIM, while LoCoNet models LIM and SIM to learn both contexts.
  • Figure 3: An overview of LoCoNet. Given a sequence of face tracks and audio of a target speaker, we sample $S-1$ speakers from all other people appearing in the scene and stack their face crops as visual input. Our method consists of 3 components: an audio encoder, a visual encoder, and a Long-Short Context Modeling module (LSCM) with $N$ blocks, where each block includes an attention-based Long-term Intra-speaker Model (LIM) and a convolution-based Short-term Inter-speaker Model (SIM) for speaker interaction. LIM involves Audio-Visual Self-Attention for long-term intra-speaker dependencies and Audio-Visual Cross-Attention for audio-visual interaction. The final output is used to classify speaking activity of the target person across all frames.
  • Figure 4: An illustration of our proposed audio encoder VGGFrame. We apply a deconvolutional layer to upsample the output feature of block-4. The output features of block-3 (before max pooling) and deconvolutional layer are concatenated and transformed to per-frame output features of shape $T \times C$.
  • Figure 5: Results comparison of LoCoNet and TalkNet on challenging scenarios of AVA-ActiveSpeaker.Red boxes denote not-active speaker. Green box denote active speaker. Orange circles refer to false predictions. The video on the left shows a multi-people conversation with four speakers, and separate conversation of two. The video on the right shows an active speaker with a small face. Both scenes are challenging, and LoCoNet predicts accurately in most cases.
  • ...and 1 more figures