Table of Contents
Fetching ...

Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-Supervised Training of Sound Events With Partial Labels

Keisuke Imoto

TL;DR

This paper tackles the high cost of annotating time boundaries for sound events by proposing a multitask framework that jointly analyzes acoustic scenes and sound events using partial labels generated from scene context. It introduces a semi-supervised training regime that combines strong labels with partial labels and employs a self-distillation-based label refinement to improve learning. The approach leverages LLM-generated partial labels to reduce labeling effort and demonstrates that ASC and SED can achieve competitive performance under reduced supervision on the TUT ASC/SED datasets. The work offers a practical pathway to scalable environmental sound analysis by balancing annotation cost with detection accuracy and outlines directions for improving partial-label quality and extending to single-task SED.

Abstract

Annotating time boundaries of sound events is labor-intensive, limiting the scalability of strongly supervised learning in audio detection. To reduce annotation costs, weakly-supervised learning with only clip-level labels has been widely adopted. As an alternative, partial label learning offers a cost-effective approach, where a set of possible labels is provided instead of exact weak annotations. However, partial label learning for audio analysis remains largely unexplored. Motivated by the observation that acoustic scenes provide contextual information for constructing a set of possible sound events, we utilize acoustic scene information to construct partial labels of sound events. On the basis of this idea, in this paper, we propose a multitask learning framework that jointly performs acoustic scene classification and sound event detection with partial labels of sound events. While reducing annotation costs, weakly-supervised and partial label learning often suffer from decreased detection performance due to lacking the precise event set and their temporal annotations. To better balance between annotation cost and detection performance, we also explore a semi-supervised framework that leverages both strong and partial labels. Moreover, to refine partial labels and achieve better model training, we propose a label refinement method based on self-distillation for the proposed approach with partial labels.

Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-Supervised Training of Sound Events With Partial Labels

TL;DR

This paper tackles the high cost of annotating time boundaries for sound events by proposing a multitask framework that jointly analyzes acoustic scenes and sound events using partial labels generated from scene context. It introduces a semi-supervised training regime that combines strong labels with partial labels and employs a self-distillation-based label refinement to improve learning. The approach leverages LLM-generated partial labels to reduce labeling effort and demonstrates that ASC and SED can achieve competitive performance under reduced supervision on the TUT ASC/SED datasets. The work offers a practical pathway to scalable environmental sound analysis by balancing annotation cost with detection accuracy and outlines directions for improving partial-label quality and extending to single-task SED.

Abstract

Annotating time boundaries of sound events is labor-intensive, limiting the scalability of strongly supervised learning in audio detection. To reduce annotation costs, weakly-supervised learning with only clip-level labels has been widely adopted. As an alternative, partial label learning offers a cost-effective approach, where a set of possible labels is provided instead of exact weak annotations. However, partial label learning for audio analysis remains largely unexplored. Motivated by the observation that acoustic scenes provide contextual information for constructing a set of possible sound events, we utilize acoustic scene information to construct partial labels of sound events. On the basis of this idea, in this paper, we propose a multitask learning framework that jointly performs acoustic scene classification and sound event detection with partial labels of sound events. While reducing annotation costs, weakly-supervised and partial label learning often suffer from decreased detection performance due to lacking the precise event set and their temporal annotations. To better balance between annotation cost and detection performance, we also explore a semi-supervised framework that leverages both strong and partial labels. Moreover, to refine partial labels and achieve better model training, we propose a label refinement method based on self-distillation for the proposed approach with partial labels.

Paper Structure

This paper contains 17 sections, 8 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Illustration comparing strong, weak, and partial labels in sound event. Strong labels provide sound event classes and their time stamps, weak labels indicate which event classes occur within an audio clip, and partial labels provide a candidate set of event labels.
  • Figure 2:
  • Figure 4: Self-distillation-based model training for semi-supervised method using partial labels of sound events
  • Figure 5: ASC performance for various ratios of weakly/partially labeled data of sound events in terms of micro-Fscore
  • Figure 6: SED performance for various ratios of weakly/partially labeled data of sound events in terms of micro-Fscore
  • ...and 5 more figures