Hierarchical Pooling Structure for Weakly Labeled Sound Event Detection
Ke-Xin He, Yu-Han Shen, Wei-Qiang Zhang
TL;DR
This work tackles weakly labeled sound event detection by formulating it as a multi-instance learning problem and identifies pooling strategy as a key performance lever. It introduces a three-stage hierarchical pooling structure that aggregates frame-level predictions into segment- and clip-level decisions, providing stronger supervision without adding model parameters. Empirical results on DCASE 2017 Task 4 show consistent improvements across pooling functions, with linear softmax often yielding the best gains, and competitive performance relative to state-of-the-art methods without ensemble techniques. The approach offers a generalizable MIL-oriented mechanism to enhance learning from weak labels in audio and related domains.
Abstract
Sound event detection with weakly labeled data is considered as a problem of multi-instance learning. And the choice of pooling function is the key to solving this problem. In this paper, we proposed a hierarchical pooling structure to improve the performance of weakly labeled sound event detection system. Proposed pooling structure has made remarkable improvements on three types of pooling function without adding any parameters. Moreover, our system has achieved competitive performance on Task 4 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Challenge using hierarchical pooling structure.
