Table of Contents
Fetching ...

SemiPL: A Semi-supervised Method for Event Sound Source Localization

Yue Li, Baiqiao Yin, Jinfu Liu, Jiajun Wen, Jiaying Lin, Mengyuan Liu

TL;DR

This work tackles event sound source localization in chaotic scenes under limited labels. It extends the self-supervised SSPL framework with SemiPL, a semi-supervised loss that combines supervised heatmap learning with an unsupervised negative-cosine consistency term $L_U$ to exploit unlabeled data. On the Chaotic World dataset, SemiPL delivers a notable improvement in $cIoU$ (≈12.2%) and $AUC$ (≈0.56%), while qualitative and ablation analyses reveal parameter sensitivity and residual challenges in multi-source scenarios. The findings underscore the value of semi-supervised strategies for robust audio–visual localization in complex real-world events and point to future work on fully supervised multi-source localization strategies.

Abstract

In recent years, Event Sound Source Localization has been widely applied in various fields. Recent works typically relying on the contrastive learning framework show impressive performance. However, all work is based on large relatively simple datasets. It's also crucial to understand and analyze human behaviors (actions and interactions of people), voices, and sounds in chaotic events in many applications, e.g., crowd management, and emergency response services. In this paper, we apply the existing model to a more complex dataset, explore the influence of parameters on the model, and propose a semi-supervised improvement method SemiPL. With the increase in data quantity and the influence of label quality, self-supervised learning will be an unstoppable trend. The experiment shows that the parameter adjustment will positively affect the existing model. In particular, SSPL achieved an improvement of 12.2% cIoU and 0.56% AUC in Chaotic World compared to the results provided. The code is available at: https://github.com/ly245422/SSPL

SemiPL: A Semi-supervised Method for Event Sound Source Localization

TL;DR

This work tackles event sound source localization in chaotic scenes under limited labels. It extends the self-supervised SSPL framework with SemiPL, a semi-supervised loss that combines supervised heatmap learning with an unsupervised negative-cosine consistency term to exploit unlabeled data. On the Chaotic World dataset, SemiPL delivers a notable improvement in (≈12.2%) and (≈0.56%), while qualitative and ablation analyses reveal parameter sensitivity and residual challenges in multi-source scenarios. The findings underscore the value of semi-supervised strategies for robust audio–visual localization in complex real-world events and point to future work on fully supervised multi-source localization strategies.

Abstract

In recent years, Event Sound Source Localization has been widely applied in various fields. Recent works typically relying on the contrastive learning framework show impressive performance. However, all work is based on large relatively simple datasets. It's also crucial to understand and analyze human behaviors (actions and interactions of people), voices, and sounds in chaotic events in many applications, e.g., crowd management, and emergency response services. In this paper, we apply the existing model to a more complex dataset, explore the influence of parameters on the model, and propose a semi-supervised improvement method SemiPL. With the increase in data quantity and the influence of label quality, self-supervised learning will be an unstoppable trend. The experiment shows that the parameter adjustment will positively affect the existing model. In particular, SSPL achieved an improvement of 12.2% cIoU and 0.56% AUC in Chaotic World compared to the results provided. The code is available at: https://github.com/ly245422/SSPL
Paper Structure (15 sections, 7 equations, 5 figures, 2 tables)

This paper contains 15 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Left: SSPL (w/ PCM) input data format. Right: SSPL (w/o PCM) input data format.
  • Figure 2: Framework of our Semi-Supervised Model SemiPL. The AM and PCM modules remain consistent with SSPL, with the addition of an unsupervised loss.
  • Figure 3: The first row is the self-supervised model, and the second row is the semi-supervised model SemiPL. It can be seen that self-supervised model has a somewhat larger recognition area for vocalized objects.
  • Figure 4: The first row is batch size 64, learning rate 5e-5, The second row is batch size 128, learning rate 3e-5, The third row is batch size 128, learning rate 5e-5,
  • Figure 5: Different learning rate parameter results. The top learning rate is 3e-5, the bottom learning rate is 5e-5.