BIAS: A Body-based Interpretable Active Speaker Approach
Tiago Roxo, Joana C. Costa, Pedro R. M. Inácio, Hugo Proença
TL;DR
BIAS addresses the limitations of audio- and face-centric active speaker detection in wild settings by incorporating body information, achieving state-of-the-art results on challenging WASD categories and strong cross-domain performance on Columbia. It introduces a novel use of Squeeze-and-Excitation blocks to generate attention heatmaps and quantify feature importance, enabling interpretability without added inference cost. The authors also develop ASD-Text, a dataset and fine-tuning setup for text-based scene descriptions in ASD contexts, culminating in a full interpretability pipeline that couples visual explanations with textual captions. This work advances ASD robustness in surveillance-like environments and provides practical, interpretable baselines for future multi-modal ASD research.
Abstract
State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial features to perform, which is not a sustainable approach in wild scenarios. Although these methods achieve good results in the standard AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the limitations of such models and raised the need for new approaches. As such, we propose BIAS, a model that, for the first time, combines audio, face, and body information, to accurately predict active speakers in varying/challenging conditions. Additionally, we design BIAS to provide interpretability by proposing a novel use for Squeeze-and-Excitation blocks, namely in attention heatmaps creation and feature importance assessment. For a full interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) to finetune a ViT-GPT2 for text scene description to complement BIAS interpretability. The results show that BIAS is state-of-the-art in challenging conditions where body-based features are of utmost importance (Columbia, open-settings, and WASD), and yields competitive results in AVA-ActiveSpeaker, where face is more influential than body for ASD. BIAS interpretability also shows the features/aspects more relevant towards ASD prediction in varying settings, making it a strong baseline for further developments in interpretable ASD models, and is available at https://github.com/Tiago-Roxo/BIAS.
