Table of Contents
Fetching ...

Towards Weakly Supervised Text-to-Audio Grounding

Xuenan Xu, Ziyang Ma, Mengyue Wu, Kai Yu

TL;DR

This work advances weakly supervised text-to-audio grounding (WSTAG) by moving from sentence-level to phrase-level supervision to reduce training/test textual mismatch. It analyzes pooling strategies, introduces advanced negative sampling (similarity-based and clustering-based) and self-supervision to refine weak labels, and demonstrates substantial gains over prior WSTAG methods while generalizing to SED datasets. The approach achieves state-of-the-art-like performance on AudioCaps/AudioGrounding and shows robust performance on the DESED dataset, especially for short-duration events. These results underscore the practicality of phrase-level WSTAG for scalable, cross-modal grounding with minimal annotation effort.

Abstract

Text-to-audio grounding (TAG) task aims to predict the onsets and offsets of sound events described by natural language. This task can facilitate applications such as multimodal information retrieval. This paper focuses on weakly-supervised text-to-audio grounding (WSTAG), where frame-level annotations of sound events are unavailable, and only the caption of a whole audio clip can be utilized for training. WSTAG is superior to strongly-supervised approaches in its scalability to large audio-text datasets. Two WSTAG frameworks are studied in this paper: sentence-level and phrase-level. First, we analyze the limitations of mean pooling used in the previous WSTAG approach and investigate the effects of different pooling strategies. We then propose phrase-level WSTAG to use matching labels between audio clips and phrases for training. Advanced negative sampling strategies and self-supervision are proposed to enhance the accuracy of the weak labels and provide pseudo strong labels. Experimental results show that our system significantly outperforms the previous WSTAG SOTA. Finally, we conduct extensive experiments to analyze the effects of several factors on phrase-level WSTAG. The code and model is available at https://github.com/wsntxxn/TextToAudioGrounding.

Towards Weakly Supervised Text-to-Audio Grounding

TL;DR

This work advances weakly supervised text-to-audio grounding (WSTAG) by moving from sentence-level to phrase-level supervision to reduce training/test textual mismatch. It analyzes pooling strategies, introduces advanced negative sampling (similarity-based and clustering-based) and self-supervision to refine weak labels, and demonstrates substantial gains over prior WSTAG methods while generalizing to SED datasets. The approach achieves state-of-the-art-like performance on AudioCaps/AudioGrounding and shows robust performance on the DESED dataset, especially for short-duration events. These results underscore the practicality of phrase-level WSTAG for scalable, cross-modal grounding with minimal annotation effort.

Abstract

Text-to-audio grounding (TAG) task aims to predict the onsets and offsets of sound events described by natural language. This task can facilitate applications such as multimodal information retrieval. This paper focuses on weakly-supervised text-to-audio grounding (WSTAG), where frame-level annotations of sound events are unavailable, and only the caption of a whole audio clip can be utilized for training. WSTAG is superior to strongly-supervised approaches in its scalability to large audio-text datasets. Two WSTAG frameworks are studied in this paper: sentence-level and phrase-level. First, we analyze the limitations of mean pooling used in the previous WSTAG approach and investigate the effects of different pooling strategies. We then propose phrase-level WSTAG to use matching labels between audio clips and phrases for training. Advanced negative sampling strategies and self-supervision are proposed to enhance the accuracy of the weak labels and provide pseudo strong labels. Experimental results show that our system significantly outperforms the previous WSTAG SOTA. Finally, we conduct extensive experiments to analyze the effects of several factors on phrase-level WSTAG. The code and model is available at https://github.com/wsntxxn/TextToAudioGrounding.
Paper Structure (33 sections, 10 equations, 14 figures, 5 tables, 2 algorithms)

This paper contains 33 sections, 10 equations, 14 figures, 5 tables, 2 algorithms.

Figures (14)

  • Figure 1: Comparison between SSTAG and WSTAG.
  • Figure 2: Sentence-level WSTAG. For an audio-caption pair, the frame-phrase similarities $\mathrm{s}_\mathrm{fp}$ are calculated. During training, audio pooling and text pooling transform them into the clip-sentence similarity $\mathrm{s}_\mathrm{cs}$ for loss calculation. During inference, $\mathrm{s}_\mathrm{fp}$ are taken as outputs.
  • Figure 3: The proposed phrase-level WSTAG approach. For an audio-caption pair, the training data contain both extracted positive phrases and sampled negative ones. A pre-trained WSTAG model is utilized to provide self-supervision. It adopts the same architecture as the WSTAG model to be trained and is trained using only $\mathcal{L}_{\text{weak}}$.
  • Figure 4: The event distribution of phrases. Each phrase is mapped to its most acoustically similar AudioSet event. The phrase count of each class is plotted.
  • Figure 5: A comparison of evaluation on a sample using PSDS and Th-AUC. PSDS measures the performance under the best threshold while Th-AUC measures the performance over all possible thresholds.
  • ...and 9 more figures