Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-Supervised Training of Sound Events With Partial Labels
Keisuke Imoto
TL;DR
This paper tackles the high cost of annotating time boundaries for sound events by proposing a multitask framework that jointly analyzes acoustic scenes and sound events using partial labels generated from scene context. It introduces a semi-supervised training regime that combines strong labels with partial labels and employs a self-distillation-based label refinement to improve learning. The approach leverages LLM-generated partial labels to reduce labeling effort and demonstrates that ASC and SED can achieve competitive performance under reduced supervision on the TUT ASC/SED datasets. The work offers a practical pathway to scalable environmental sound analysis by balancing annotation cost with detection accuracy and outlines directions for improving partial-label quality and extending to single-task SED.
Abstract
Annotating time boundaries of sound events is labor-intensive, limiting the scalability of strongly supervised learning in audio detection. To reduce annotation costs, weakly-supervised learning with only clip-level labels has been widely adopted. As an alternative, partial label learning offers a cost-effective approach, where a set of possible labels is provided instead of exact weak annotations. However, partial label learning for audio analysis remains largely unexplored. Motivated by the observation that acoustic scenes provide contextual information for constructing a set of possible sound events, we utilize acoustic scene information to construct partial labels of sound events. On the basis of this idea, in this paper, we propose a multitask learning framework that jointly performs acoustic scene classification and sound event detection with partial labels of sound events. While reducing annotation costs, weakly-supervised and partial label learning often suffer from decreased detection performance due to lacking the precise event set and their temporal annotations. To better balance between annotation cost and detection performance, we also explore a semi-supervised framework that leverages both strong and partial labels. Moreover, to refine partial labels and achieve better model training, we propose a label refinement method based on self-distillation for the proposed approach with partial labels.
