Table of Contents
Fetching ...

Audio-Visual Talker Localization in Video for Spatial Sound Reproduction

Davide Berghi, Philip J. B. Jackson

TL;DR

This work tackles automatic extraction of speaker positional metadata for object-based media by pursuing audio-visual active speaker detection and localization (ASDL) in video. It introduces an audio-visual architecture (AV-Conformer) that fuses multichannel audio features (log-mel plus GCC-PHAT) with visual embeddings (ResNet50 + Conformer) and conditions predictions on a target camera view. Using the Tragic Talkers dataset, the proposed AV-M system outperforms single-channel audio-visual baselines and prior multichannel audio methods, improving both detection reliability and spatial accuracy. The results demonstrate the practical value of jointly leveraging multichannel audio with visual cues for automated, camera-aligned localization of active talkers in immersive object-based audio workflows.

Abstract

Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed to make recordings, and the human voice is often a key element. In this research, we detect and locate the active speaker in the video, facilitating the automatic extraction of the positional metadata of the talker relative to the camera's reference frame. With the integration of the visual modality, this study expands upon our previous investigation focused solely on audio-based active speaker detection and localization. Our experiments compare conventional audio-visual approaches for active speaker detection that leverage monaural audio, our previous audio-only method that leverages multichannel recordings from a microphone array, and a novel audio-visual approach integrating vision and multichannel audio. We found the role of the two modalities to complement each other. Multichannel audio, overcoming the problem of visual occlusions, provides a double-digit reduction in detection error compared to audio-visual methods with single-channel audio. The combination of multichannel audio and vision further enhances spatial accuracy, leading to a four-percentage point increase in F1 score on the Tragic Talkers dataset. Future investigations will assess the robustness of the model in noisy and highly reverberant environments, as well as tackle the problem of off-screen speakers.

Audio-Visual Talker Localization in Video for Spatial Sound Reproduction

TL;DR

This work tackles automatic extraction of speaker positional metadata for object-based media by pursuing audio-visual active speaker detection and localization (ASDL) in video. It introduces an audio-visual architecture (AV-Conformer) that fuses multichannel audio features (log-mel plus GCC-PHAT) with visual embeddings (ResNet50 + Conformer) and conditions predictions on a target camera view. Using the Tragic Talkers dataset, the proposed AV-M system outperforms single-channel audio-visual baselines and prior multichannel audio methods, improving both detection reliability and spatial accuracy. The results demonstrate the practical value of jointly leveraging multichannel audio with visual cues for automated, camera-aligned localization of active talkers in immersive object-based audio workflows.

Abstract

Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed to make recordings, and the human voice is often a key element. In this research, we detect and locate the active speaker in the video, facilitating the automatic extraction of the positional metadata of the talker relative to the camera's reference frame. With the integration of the visual modality, this study expands upon our previous investigation focused solely on audio-based active speaker detection and localization. Our experiments compare conventional audio-visual approaches for active speaker detection that leverage monaural audio, our previous audio-only method that leverages multichannel recordings from a microphone array, and a novel audio-visual approach integrating vision and multichannel audio. We found the role of the two modalities to complement each other. Multichannel audio, overcoming the problem of visual occlusions, provides a double-digit reduction in detection error compared to audio-visual methods with single-channel audio. The combination of multichannel audio and vision further enhances spatial accuracy, leading to a four-percentage point increase in F1 score on the Tragic Talkers dataset. Future investigations will assess the robustness of the model in noisy and highly reverberant environments, as well as tackle the problem of off-screen speakers.
Paper Structure (15 sections, 2 equations, 4 figures, 1 table)

This paper contains 15 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Pipeline for speech signals objectification proposed by Mohd Izhar et al. izhar:2020:AVtracker and Schweiger et al. Schweiger:2022:tool6dof. Positional metadata are automatically predicted by leveraging video and microphone array data. These predictions are not only used as final positional information for the spatialization of the objects but also to drive a spatial beamformer. Filtered signals extracted with the beamformer are associated with, and replaced by, the high-quality speech data recorded with Lavalier microphones. This paper focuses on the audio-visual prediction of the speaker's positional metadata.
  • Figure 2: (a) Schematic of camera (blue circles) and microphone (red dots) positions on the AVA Rig. The green square highlights the reference microphone. (b) Photo of an AVA Rig.
  • Figure 3: Proposed network architecture for audio-visual ASDL. An audio encoder based on a CNN extracts an audio embedding from the audio input features. Similarly, an encoder consisting of ResNet50 He:2016:resnet followed by a Conformer unit Gulati2020ConformerCT extracts a visual embedding from the video frames. $\otimes$ denotes the concatenation operation. After concatenation, the audio-visual features and processed by a second Conformer unit. A feed-forward network generates the final prediction. A camera ID one-hot vector is used to regress the speaker's position to the desired camera view.
  • Figure 4: Comparison of precision versus recall curves for different ASDL methods. The plot includes audio-visual methods with single-channel audio (AV-S), i.e., ASC Alcazar_2020_CVPR and TalkNet tao2021TalkNet, the multichannel audio-only CRNN (M-S) Berghi:2024:TASLP, and the proposed audio-visual approach with multichannel audio (AV-M). The combination of precision and recall rates that achieves the highest F1 score is marked on each curve.