Table of Contents
Fetching ...

Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization

Yuxin Guo, Shijie Ma, Yuhao Zhao, Hu Su, Wei Zou

TL;DR

The paper tackles semi-supervised Audio-Visual Source Localization (AVSL) by introducing Cross Pseudo-Labeling (XPL), where two models learn from each other through a cross-refine mechanism to mitigate confirmation bias inherent in single-model pseudo-labeling. It combines soft pseudo-labels with sharpening and pseudo-label exponential moving average (PL-EMA), plus a Curriculum Data Selection strategy to feed only reliable samples, all integrated into a joint objective that includes cross-pseudo supervision and an audio-visual contrastive loss. Key contributions include the cross-refine mechanism, soft-label self-training with EMA, and the curriculum-based data selection, validated by state-of-the-art results on Flickr-SoundNet and VGG-SoundSource under limited supervision and in open-set scenarios. The approach yields improved CIoU and AUC, better generalization, and increased training stability, signaling a robust and scalable path for multimodal localization tasks.

Abstract

Audio-Visual Source Localization (AVSL) is the task of identifying specific sounding objects in the scene given audio cues. In our work, we focus on semi-supervised AVSL with pseudo-labeling. To address the issues with vanilla hard pseudo-labels including bias accumulation, noise sensitivity, and instability, we propose a novel method named Cross Pseudo-Labeling (XPL), wherein two models learn from each other with the cross-refine mechanism to avoid bias accumulation. We equip XPL with two effective components. Firstly, the soft pseudo-labels with sharpening and pseudo-label exponential moving average mechanisms enable models to achieve gradual self-improvement and ensure stable training. Secondly, the curriculum data selection module adaptively selects pseudo-labels with high quality during training to mitigate potential bias. Experimental results demonstrate that XPL significantly outperforms existing methods, achieving state-of-the-art performance while effectively mitigating confirmation bias and ensuring training stability.

Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization

TL;DR

The paper tackles semi-supervised Audio-Visual Source Localization (AVSL) by introducing Cross Pseudo-Labeling (XPL), where two models learn from each other through a cross-refine mechanism to mitigate confirmation bias inherent in single-model pseudo-labeling. It combines soft pseudo-labels with sharpening and pseudo-label exponential moving average (PL-EMA), plus a Curriculum Data Selection strategy to feed only reliable samples, all integrated into a joint objective that includes cross-pseudo supervision and an audio-visual contrastive loss. Key contributions include the cross-refine mechanism, soft-label self-training with EMA, and the curriculum-based data selection, validated by state-of-the-art results on Flickr-SoundNet and VGG-SoundSource under limited supervision and in open-set scenarios. The approach yields improved CIoU and AUC, better generalization, and increased training stability, signaling a robust and scalable path for multimodal localization tasks.

Abstract

Audio-Visual Source Localization (AVSL) is the task of identifying specific sounding objects in the scene given audio cues. In our work, we focus on semi-supervised AVSL with pseudo-labeling. To address the issues with vanilla hard pseudo-labels including bias accumulation, noise sensitivity, and instability, we propose a novel method named Cross Pseudo-Labeling (XPL), wherein two models learn from each other with the cross-refine mechanism to avoid bias accumulation. We equip XPL with two effective components. Firstly, the soft pseudo-labels with sharpening and pseudo-label exponential moving average mechanisms enable models to achieve gradual self-improvement and ensure stable training. Secondly, the curriculum data selection module adaptively selects pseudo-labels with high quality during training to mitigate potential bias. Experimental results demonstrate that XPL significantly outperforms existing methods, achieving state-of-the-art performance while effectively mitigating confirmation bias and ensuring training stability.
Paper Structure (12 sections, 7 equations, 2 figures, 2 tables)

This paper contains 12 sections, 7 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: An overview of the proposed XPL (up). Two distinct AVSL models (bottom blue) generate prediction maps for input audio-visual pairs. To ensure stability by an initial, Curriculum Data Selection mechanism (bottom yellow) sorts the samples by reliability and feeds them to the model in batches. Then, Sharpening and PL-EMA module (bottom green) sharpens the prediction map, and performs exponential moving average (EMA) with the previous time step's pseudo-label to obtain the final pseudo-label. Finally, each model utilizes the pseudo-labels generated by the other for training and thus rectifying bias.
  • Figure 2: Visualization of the proposed XPL. We can observe that XPL can localize sounding objects of various sizes accurately and effectively distinguishing foreground and background elements.