Table of Contents
Fetching ...

Leveraging Reverberation and Visual Depth Cues for Sound Event Localization and Detection with Distance Estimation

Davide Berghi, Philip J. B. Jackson

Abstract

This report describes our systems submitted for the DCASE2024 Task 3 challenge: Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation (Track B). Our main model is based on the audio-visual (AV) Conformer, which processes video and audio embeddings extracted with ResNet50 and with an audio encoder pre-trained on SELD, respectively. This model outperformed the audio-visual baseline of the development set of the STARSS23 dataset by a wide margin, halving its DOAE and improving the F1 by more than 3x. Our second system performs a temporal ensemble from the outputs of the AV-Conformer. We then extended the model with features for distance estimation, such as direct and reverberant signal components extracted from the omnidirectional audio channel, and depth maps extracted from the video frames. While the new system improved the RDE of our previous model by about 3 percentage points, it achieved a lower F1 score. This may be caused by sound classes that rarely appear in the training set and that the more complex system does not detect, as analysis can determine. To overcome this problem, our fourth and final system consists of an ensemble strategy combining the predictions of the other three. Many opportunities to refine the system and training strategy can be tested in future ablation experiments, and likely achieve incremental performance gains for this audio-visual task.

Leveraging Reverberation and Visual Depth Cues for Sound Event Localization and Detection with Distance Estimation

Abstract

This report describes our systems submitted for the DCASE2024 Task 3 challenge: Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation (Track B). Our main model is based on the audio-visual (AV) Conformer, which processes video and audio embeddings extracted with ResNet50 and with an audio encoder pre-trained on SELD, respectively. This model outperformed the audio-visual baseline of the development set of the STARSS23 dataset by a wide margin, halving its DOAE and improving the F1 by more than 3x. Our second system performs a temporal ensemble from the outputs of the AV-Conformer. We then extended the model with features for distance estimation, such as direct and reverberant signal components extracted from the omnidirectional audio channel, and depth maps extracted from the video frames. While the new system improved the RDE of our previous model by about 3 percentage points, it achieved a lower F1 score. This may be caused by sound classes that rarely appear in the training set and that the more complex system does not detect, as analysis can determine. To overcome this problem, our fourth and final system consists of an ensemble strategy combining the predictions of the other three. Many opportunities to refine the system and training strategy can be tested in future ablation experiments, and likely achieve incremental performance gains for this audio-visual task.

Paper Structure

This paper contains 15 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Diagram of the two main architectures submitted to the challenge. On the left, the AV-Conformer with the respective audio and visual encoders. On the right, the Depth-cued model that includes a frame encoder employed to extract visual features from the central frame. The Depth-cued model leverages cubemap views whereas the AV-Conformer equirectangular views. The snowflake symbol indicates that the weights of ResNet50 and ViT are fixed during training.
  • Figure 2: Examples of visual transformation in relation to the respective DOA augmentation for "fold3_room6_mix006.mp4".
  • Figure 3: Examples of cubemap transformation and depth map features for "fold3_room13_mix003.mp4". Note how by removing the top and bottom faces from the cubemap representation the EigenMike and a good portion of the ceiling are no longer in the frame.