Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization
Yuxin Guo, Shijie Ma, Yuhao Zhao, Hu Su, Wei Zou
TL;DR
The paper tackles semi-supervised Audio-Visual Source Localization (AVSL) by introducing Cross Pseudo-Labeling (XPL), where two models learn from each other through a cross-refine mechanism to mitigate confirmation bias inherent in single-model pseudo-labeling. It combines soft pseudo-labels with sharpening and pseudo-label exponential moving average (PL-EMA), plus a Curriculum Data Selection strategy to feed only reliable samples, all integrated into a joint objective that includes cross-pseudo supervision and an audio-visual contrastive loss. Key contributions include the cross-refine mechanism, soft-label self-training with EMA, and the curriculum-based data selection, validated by state-of-the-art results on Flickr-SoundNet and VGG-SoundSource under limited supervision and in open-set scenarios. The approach yields improved CIoU and AUC, better generalization, and increased training stability, signaling a robust and scalable path for multimodal localization tasks.
Abstract
Audio-Visual Source Localization (AVSL) is the task of identifying specific sounding objects in the scene given audio cues. In our work, we focus on semi-supervised AVSL with pseudo-labeling. To address the issues with vanilla hard pseudo-labels including bias accumulation, noise sensitivity, and instability, we propose a novel method named Cross Pseudo-Labeling (XPL), wherein two models learn from each other with the cross-refine mechanism to avoid bias accumulation. We equip XPL with two effective components. Firstly, the soft pseudo-labels with sharpening and pseudo-label exponential moving average mechanisms enable models to achieve gradual self-improvement and ensure stable training. Secondly, the curriculum data selection module adaptively selects pseudo-labels with high quality during training to mitigate potential bias. Experimental results demonstrate that XPL significantly outperforms existing methods, achieving state-of-the-art performance while effectively mitigating confirmation bias and ensuring training stability.
