Table of Contents
Fetching ...

BIAS: A Body-based Interpretable Active Speaker Approach

Tiago Roxo, Joana C. Costa, Pedro R. M. Inácio, Hugo Proença

TL;DR

BIAS addresses the limitations of audio- and face-centric active speaker detection in wild settings by incorporating body information, achieving state-of-the-art results on challenging WASD categories and strong cross-domain performance on Columbia. It introduces a novel use of Squeeze-and-Excitation blocks to generate attention heatmaps and quantify feature importance, enabling interpretability without added inference cost. The authors also develop ASD-Text, a dataset and fine-tuning setup for text-based scene descriptions in ASD contexts, culminating in a full interpretability pipeline that couples visual explanations with textual captions. This work advances ASD robustness in surveillance-like environments and provides practical, interpretable baselines for future multi-modal ASD research.

Abstract

State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial features to perform, which is not a sustainable approach in wild scenarios. Although these methods achieve good results in the standard AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the limitations of such models and raised the need for new approaches. As such, we propose BIAS, a model that, for the first time, combines audio, face, and body information, to accurately predict active speakers in varying/challenging conditions. Additionally, we design BIAS to provide interpretability by proposing a novel use for Squeeze-and-Excitation blocks, namely in attention heatmaps creation and feature importance assessment. For a full interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) to finetune a ViT-GPT2 for text scene description to complement BIAS interpretability. The results show that BIAS is state-of-the-art in challenging conditions where body-based features are of utmost importance (Columbia, open-settings, and WASD), and yields competitive results in AVA-ActiveSpeaker, where face is more influential than body for ASD. BIAS interpretability also shows the features/aspects more relevant towards ASD prediction in varying settings, making it a strong baseline for further developments in interpretable ASD models, and is available at https://github.com/Tiago-Roxo/BIAS.

BIAS: A Body-based Interpretable Active Speaker Approach

TL;DR

BIAS addresses the limitations of audio- and face-centric active speaker detection in wild settings by incorporating body information, achieving state-of-the-art results on challenging WASD categories and strong cross-domain performance on Columbia. It introduces a novel use of Squeeze-and-Excitation blocks to generate attention heatmaps and quantify feature importance, enabling interpretability without added inference cost. The authors also develop ASD-Text, a dataset and fine-tuning setup for text-based scene descriptions in ASD contexts, culminating in a full interpretability pipeline that couples visual explanations with textual captions. This work advances ASD robustness in surveillance-like environments and provides practical, interpretable baselines for future multi-modal ASD research.

Abstract

State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial features to perform, which is not a sustainable approach in wild scenarios. Although these methods achieve good results in the standard AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the limitations of such models and raised the need for new approaches. As such, we propose BIAS, a model that, for the first time, combines audio, face, and body information, to accurately predict active speakers in varying/challenging conditions. Additionally, we design BIAS to provide interpretability by proposing a novel use for Squeeze-and-Excitation blocks, namely in attention heatmaps creation and feature importance assessment. For a full interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) to finetune a ViT-GPT2 for text scene description to complement BIAS interpretability. The results show that BIAS is state-of-the-art in challenging conditions where body-based features are of utmost importance (Columbia, open-settings, and WASD), and yields competitive results in AVA-ActiveSpeaker, where face is more influential than body for ASD. BIAS interpretability also shows the features/aspects more relevant towards ASD prediction in varying settings, making it a strong baseline for further developments in interpretable ASD models, and is available at https://github.com/Tiago-Roxo/BIAS.

Paper Structure

This paper contains 18 sections, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Illustration of the BIAS insight: in surveillance settings, where facial and audio-based features might not be always available, body data should be crucial to accurately detect the active speakers. In such challenging conditions, providing reliable explanations for the reasoning behind the provided responses is also an important feature. This paper describes BIAS, which singularly uses facial, audio, and body-based features, also providing visual interpretability and feature importance assessment for its responses.
  • Figure 2: Overview of the BIAS architecture and pipeline, with GPT model integration: body and face-based data is fed into the respective visual encoders, while audio is processed into MFCC before encoding. SE blocks are used in visual encoders and feature combination for attention heatmaps and feature relative importance, respectively. SA refers to self-attention blocks. Heatmaps are created by combining channel features of the respective top 10% SE vector values. BIAS prediction is based on feature combination, accompanied with visual interpretability and feature importance assessment, complemented by text descriptions from a GPT model finetuned in ASD-related actions data (ASD-Text).
  • Figure 3: Comparison of face and body area, relative to image dimension, in percentage. AVA-ActiveSpeaker contains data with subjects closer to the camera, expressed by higher face and body percentage, relative to WASD and any of its categories. Surveillance Settings (SS) is the category with further distance of subjects from camera.
  • Figure 4: SE vector values from feature (audio, body, and face) combination, for BIAS trained in AVA-ActiveSpeaker and WASD. Both datasets follow a normal distribution. AVA refers to AVA-ActiveSpeaker.
  • Figure 5: Performance of BIAS, TalkNet, and BIAS$_{F}$ relative to Head-Body Proportion (HBP) in WASD, across 5 equidistant intervals based on minimum (0.1) and maximum (0.7) HBP. BIAS$_{F}$ refers to BIAS with only face as visual input.
  • ...and 7 more figures