Table of Contents
Fetching ...

Diversifying and Expanding Frequency-Adaptive Convolution Kernels for Sound Event Detection

Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Junhyeok Lee, Yong-Hwa Park

TL;DR

FDY conv achieves state-of-the-art SED but may be limited by low diversity among basis kernels. This work introduces dilated frequency dynamic convolution (DFD conv), applying varied dilation sizes to multiple basis kernels to diversify and expand the spectral receptive field. On the DESED dataset, DFD conv improves SED performance, with the best configuration surpassing the FDY baseline by 3.12% PSDS when combined with class-wise median filtering, supported by analysis of attention-weight variance. The results show that frequency-axis dilation is beneficial and that diversified kernels can improve SED without increasing model size, offering a scalable approach for frequency-adaptive convolutions in polyphonic audio tasks.

Abstract

Frequency dynamic convolution (FDY conv) has shown the state-of-the-art performance in sound event detection (SED) using frequency-adaptive kernels obtained by frequency-varying combination of basis kernels. However, FDY conv lacks an explicit mean to diversify frequency-adaptive kernels, potentially limiting the performance. In addition, size of basis kernels is limited while time-frequency patterns span larger spectro-temporal range. Therefore, we propose dilated frequency dynamic convolution (DFD conv) which diversifies and expands frequency-adaptive kernels by introducing different dilation sizes to basis kernels. Experiments showed advantages of varying dilation sizes along frequency dimension, and analysis on attention weight variance proved dilated basis kernels are effectively diversified. By adapting class-wise median filter with intersection-based F1 score, proposed DFD-CRNN outperforms FDY-CRNN by 3.12% in terms of polyphonic sound detection score (PSDS).

Diversifying and Expanding Frequency-Adaptive Convolution Kernels for Sound Event Detection

TL;DR

FDY conv achieves state-of-the-art SED but may be limited by low diversity among basis kernels. This work introduces dilated frequency dynamic convolution (DFD conv), applying varied dilation sizes to multiple basis kernels to diversify and expand the spectral receptive field. On the DESED dataset, DFD conv improves SED performance, with the best configuration surpassing the FDY baseline by 3.12% PSDS when combined with class-wise median filtering, supported by analysis of attention-weight variance. The results show that frequency-axis dilation is beneficial and that diversified kernels can improve SED without increasing model size, offering a scalable approach for frequency-adaptive convolutions in polyphonic audio tasks.

Abstract

Frequency dynamic convolution (FDY conv) has shown the state-of-the-art performance in sound event detection (SED) using frequency-adaptive kernels obtained by frequency-varying combination of basis kernels. However, FDY conv lacks an explicit mean to diversify frequency-adaptive kernels, potentially limiting the performance. In addition, size of basis kernels is limited while time-frequency patterns span larger spectro-temporal range. Therefore, we propose dilated frequency dynamic convolution (DFD conv) which diversifies and expands frequency-adaptive kernels by introducing different dilation sizes to basis kernels. Experiments showed advantages of varying dilation sizes along frequency dimension, and analysis on attention weight variance proved dilated basis kernels are effectively diversified. By adapting class-wise median filter with intersection-based F1 score, proposed DFD-CRNN outperforms FDY-CRNN by 3.12% in terms of polyphonic sound detection score (PSDS).
Paper Structure (13 sections, 2 equations, 2 figures, 4 tables)

This paper contains 13 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: An illustration of dilated frequency dynamic convolution operation. $x$ and $y$ are input and output of DFD conv. $K$ is number of basis kernels, $W_i$ and $b_i$, $d_i$ and $\pi_i$ are weight, bias, dilation size and frequency-adaptive attention weight for $i$-th basis kernel.
  • Figure 2: Plots comparing variance of attention weights on 2nd - 6th convolution layers in FDY-CRNN and DFD-CRNN.