Table of Contents
Fetching ...

Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization

Yuxin Guo, Shijie Ma, Hu Su, Zhiqing Wang, Yuhao Zhao, Wei Zou, Siyang Sun, Yun Zheng

TL;DR

This paper tackles Audio-Visual Source Localization (AVSL) under limited bounding-box supervision by introducing Dual Mean-Teacher (DMT), a dual teacher–student semi-supervised framework. DMT uses consensus-based Noise Filtering and Intersection of Pseudo-Labels (IPL) to generate high-quality pseudo-labels, with both teachers and students updated via EMA to curb confirmation bias. The approach, which includes a warm-up stage and an unbiased-learning stage combining supervised and contrastive objectives, achieves state-of-the-art CIoU on Flickr-SoundNet and VGG-Sound Source and demonstrates substantial gains with only a small fraction of labeled data (e.g., 3%), while also extending to existing AVSL methods. The results underscore DMT’s ability to improve localization accuracy, especially for small objects, and its potential to generalize across domains and data sources, making it a versatile framework for semi-supervised AVSL.

Abstract

Audio-Visual Source Localization (AVSL) aims to locate sounding objects within video frames given the paired audio clips. Existing methods predominantly rely on self-supervised contrastive learning of audio-visual correspondence. Without any bounding-box annotations, they struggle to achieve precise localization, especially for small objects, and suffer from blurry boundaries and false positives. Moreover, the naive semi-supervised method is poor in fully leveraging the information of abundant unlabeled data. In this paper, we propose a novel semi-supervised learning framework for AVSL, namely Dual Mean-Teacher (DMT), comprising two teacher-student structures to circumvent the confirmation bias issue. Specifically, two teachers, pre-trained on limited labeled data, are employed to filter out noisy samples via the consensus between their predictions, and then generate high-quality pseudo-labels by intersecting their confidence maps. The sufficient utilization of both labeled and unlabeled data and the proposed unbiased framework enable DMT to outperform current state-of-the-art methods by a large margin, with CIoU of 90.4% and 48.8% on Flickr-SoundNet and VGG-Sound Source, obtaining 8.9%, 9.6% and 4.6%, 6.4% improvements over self- and semi-supervised methods respectively, given only 3% positional-annotations. We also extend our framework to some existing AVSL methods and consistently boost their performance.

Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization

TL;DR

This paper tackles Audio-Visual Source Localization (AVSL) under limited bounding-box supervision by introducing Dual Mean-Teacher (DMT), a dual teacher–student semi-supervised framework. DMT uses consensus-based Noise Filtering and Intersection of Pseudo-Labels (IPL) to generate high-quality pseudo-labels, with both teachers and students updated via EMA to curb confirmation bias. The approach, which includes a warm-up stage and an unbiased-learning stage combining supervised and contrastive objectives, achieves state-of-the-art CIoU on Flickr-SoundNet and VGG-Sound Source and demonstrates substantial gains with only a small fraction of labeled data (e.g., 3%), while also extending to existing AVSL methods. The results underscore DMT’s ability to improve localization accuracy, especially for small objects, and its potential to generalize across domains and data sources, making it a versatile framework for semi-supervised AVSL.

Abstract

Audio-Visual Source Localization (AVSL) aims to locate sounding objects within video frames given the paired audio clips. Existing methods predominantly rely on self-supervised contrastive learning of audio-visual correspondence. Without any bounding-box annotations, they struggle to achieve precise localization, especially for small objects, and suffer from blurry boundaries and false positives. Moreover, the naive semi-supervised method is poor in fully leveraging the information of abundant unlabeled data. In this paper, we propose a novel semi-supervised learning framework for AVSL, namely Dual Mean-Teacher (DMT), comprising two teacher-student structures to circumvent the confirmation bias issue. Specifically, two teachers, pre-trained on limited labeled data, are employed to filter out noisy samples via the consensus between their predictions, and then generate high-quality pseudo-labels by intersecting their confidence maps. The sufficient utilization of both labeled and unlabeled data and the proposed unbiased framework enable DMT to outperform current state-of-the-art methods by a large margin, with CIoU of 90.4% and 48.8% on Flickr-SoundNet and VGG-Sound Source, obtaining 8.9%, 9.6% and 4.6%, 6.4% improvements over self- and semi-supervised methods respectively, given only 3% positional-annotations. We also extend our framework to some existing AVSL methods and consistently boost their performance.
Paper Structure (72 sections, 20 equations, 10 figures, 15 tables, 1 algorithm)

This paper contains 72 sections, 20 equations, 10 figures, 15 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison of existing Audio-Visual Source Localization (AVSL) methods and the proposed Dual Mean-Teacher (DMT). Left: DMT has greatly addressed severe issues including inaccurate small object localization, blurry boundaries, and instability. Right: DMT outperforms previous by a large margin on Flickr and VGG-ss datasets.
  • Figure 2: Overview of the proposed Dual Mean-Teacher framework. Left: Overall learning process of dual teacher-student pairs, two students are guided by both ground-truth labeled data and filtered unlabeled data with the Intersection of Pseudo-Labels (IPL). Upper-right: Details of Noise Filtering and IPL. Dual teachers reject noise samples based on their consensus and generate pseudo-labels on filtered data. Lower-right: Details of AVSL pipeline. Students are learned through contrastive learning and predict confidence maps for supervised learning with (pseudo) labels.
  • Figure 3: Performance on music-domain.
  • Figure 4: Effect of Warm-Up Stage.
  • Figure 5: The effect of each component (Noise Filtering, IPL and EMA) in DMT to suppress confirmation bias, together with the number of filtered samples for pseudo labeling depicted in (d).
  • ...and 5 more figures