Table of Contents
Fetching ...

Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection

Haobo Yue, Zhicheng Zhang, Da Mu, Yonghao Dang, Jianqin Yin, Jin Tang

TL;DR

To address SED, the paper argues that standard 2D convolution on spectrograms, often described by $y = \boldsymbol{W} * x + \boldsymbol{b}$, is not frequency-invariant for audio signals. It introduces full-frequency dynamic convolution (FFDConv), which generates per-frequency kernels through a two-branch filter generator (spatial and channel) to realize frequency-dependent modeling. On the DESED real validation set, FFDConv outperforms the baseline and other full-dynamic methods, achieving the best PSDS2 and IB-F1 and reducing parameters by about two-thirds relative to the strongest baselines, while also yielding temporally coherent, frequency-specific features. This work provides a physically grounded approach to SED that leverages frequency-band specialization to improve discrimination among overlapping sound events.

Abstract

Recently, 2D convolution has been found unqualified in sound event detection (SED). It enforces translation equivariance on sound events along frequency axis, which is not a shift-invariant dimension. To address this issue, dynamic convolution is used to model the frequency dependency of sound events. In this paper, we proposed the first full-dynamic method named full-frequency dynamic convolution (FFDConv). FFDConv generates frequency kernels for every frequency band, which is designed directly in the structure for frequency-dependent modeling. It physically furnished 2D convolution with the capability of frequency-dependent modeling. FFDConv outperforms not only the baseline by 6.6% in DESED real validation dataset in terms of PSDS1, but outperforms the other full-dynamic methods. In addition, by visualizing features of sound events, we observed that FFDConv could effectively extract coherent features in specific frequency bands, consistent with the vocal continuity of sound events. This proves that FFDConv has great frequency-dependent perception ability.

Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection

TL;DR

To address SED, the paper argues that standard 2D convolution on spectrograms, often described by , is not frequency-invariant for audio signals. It introduces full-frequency dynamic convolution (FFDConv), which generates per-frequency kernels through a two-branch filter generator (spatial and channel) to realize frequency-dependent modeling. On the DESED real validation set, FFDConv outperforms the baseline and other full-dynamic methods, achieving the best PSDS2 and IB-F1 and reducing parameters by about two-thirds relative to the strongest baselines, while also yielding temporally coherent, frequency-specific features. This work provides a physically grounded approach to SED that leverages frequency-band specialization to improve discrimination among overlapping sound events.

Abstract

Recently, 2D convolution has been found unqualified in sound event detection (SED). It enforces translation equivariance on sound events along frequency axis, which is not a shift-invariant dimension. To address this issue, dynamic convolution is used to model the frequency dependency of sound events. In this paper, we proposed the first full-dynamic method named full-frequency dynamic convolution (FFDConv). FFDConv generates frequency kernels for every frequency band, which is designed directly in the structure for frequency-dependent modeling. It physically furnished 2D convolution with the capability of frequency-dependent modeling. FFDConv outperforms not only the baseline by 6.6% in DESED real validation dataset in terms of PSDS1, but outperforms the other full-dynamic methods. In addition, by visualizing features of sound events, we observed that FFDConv could effectively extract coherent features in specific frequency bands, consistent with the vocal continuity of sound events. This proves that FFDConv has great frequency-dependent perception ability.
Paper Structure (14 sections, 2 equations, 4 figures, 2 tables)

This paper contains 14 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of frequency-dependent modeling. Top models time-frequency patterns in the same space with a shared kernel. Bottom models them in serval spaces with frequency-adaptive kernels, in which time-frequency patterns specific to sound events can be considered.
  • Figure 2: Illustration of full-frequency dynamic convolution. In general, the factory produces frequency-dependent kernels from acoustic feature, and then kernels are convoluted with input along the time axis. In the factory, there are two workshops aiming to produce spatial filters and channel filters, respectively. And they are integrated in the assembly workshop.
  • Figure 3: Details of the FFDConv
  • Figure 4: Feature comparison of FFDConv and CRNN. Features activation of the 5th Conv block are shown in the 4th row. The trends of frequency band features over time are shown in the 5th row. Note that y-axis labels of strong prediction are abbreviations of the sound event categories. For example, Abr stands for Alarm bell ringing.