Table of Contents
Fetching ...

Hierarchical Pooling Structure for Weakly Labeled Sound Event Detection

Ke-Xin He, Yu-Han Shen, Wei-Qiang Zhang

TL;DR

This work tackles weakly labeled sound event detection by formulating it as a multi-instance learning problem and identifies pooling strategy as a key performance lever. It introduces a three-stage hierarchical pooling structure that aggregates frame-level predictions into segment- and clip-level decisions, providing stronger supervision without adding model parameters. Empirical results on DCASE 2017 Task 4 show consistent improvements across pooling functions, with linear softmax often yielding the best gains, and competitive performance relative to state-of-the-art methods without ensemble techniques. The approach offers a generalizable MIL-oriented mechanism to enhance learning from weak labels in audio and related domains.

Abstract

Sound event detection with weakly labeled data is considered as a problem of multi-instance learning. And the choice of pooling function is the key to solving this problem. In this paper, we proposed a hierarchical pooling structure to improve the performance of weakly labeled sound event detection system. Proposed pooling structure has made remarkable improvements on three types of pooling function without adding any parameters. Moreover, our system has achieved competitive performance on Task 4 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Challenge using hierarchical pooling structure.

Hierarchical Pooling Structure for Weakly Labeled Sound Event Detection

TL;DR

This work tackles weakly labeled sound event detection by formulating it as a multi-instance learning problem and identifies pooling strategy as a key performance lever. It introduces a three-stage hierarchical pooling structure that aggregates frame-level predictions into segment- and clip-level decisions, providing stronger supervision without adding model parameters. Empirical results on DCASE 2017 Task 4 show consistent improvements across pooling functions, with linear softmax often yielding the best gains, and competitive performance relative to state-of-the-art methods without ensemble techniques. The approach offers a generalizable MIL-oriented mechanism to enhance learning from weak labels in audio and related domains.

Abstract

Sound event detection with weakly labeled data is considered as a problem of multi-instance learning. And the choice of pooling function is the key to solving this problem. In this paper, we proposed a hierarchical pooling structure to improve the performance of weakly labeled sound event detection system. Proposed pooling structure has made remarkable improvements on three types of pooling function without adding any parameters. Moreover, our system has achieved competitive performance on Task 4 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Challenge using hierarchical pooling structure.

Paper Structure

This paper contains 16 sections, 34 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of Multi-Instance Learning System for sound event detection with weakly labeled data.
  • Figure 2: Overview of baseline system.
  • Figure 3: Architecture of neural networks. The first and second dimensions of convolutional kernels and strides represent the time axis and frequency axis respectively. The size of all convolutional kernels is $3\times 3$.
  • Figure 4: Three-stage hierarchical pooling structure. In linear and exponential softmax pooling, the frame-level weights $w_i$ derive from frame-level predictions $x_i$ ; in attention pooling, they are learnt from the output of Bi-GRU. In the first stage, every five frames are aggregated together to get segment-level predictions $\hat{x}_{j}$; the weights of every five frames are averaged to get segment-level weights $\hat{w}_{j}$. In the second stage, every five segments are aggregated to get longer-segment-level predictions $\widetilde{x}_k$ and every five segment-level weights are averaged to get longer-segment-level weights $\widetilde{w}_k$. In the end, $\widetilde{x}_k$ and $\widetilde{w}_k$ are aggregated to get final clip-level prediction.
  • Figure 5: The frame-level predictions of three systems on an evaluation audio clip.