Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection
Haobo Yue, Zhicheng Zhang, Da Mu, Yonghao Dang, Jianqin Yin, Jin Tang
TL;DR
To address SED, the paper argues that standard 2D convolution on spectrograms, often described by $y = \boldsymbol{W} * x + \boldsymbol{b}$, is not frequency-invariant for audio signals. It introduces full-frequency dynamic convolution (FFDConv), which generates per-frequency kernels through a two-branch filter generator (spatial and channel) to realize frequency-dependent modeling. On the DESED real validation set, FFDConv outperforms the baseline and other full-dynamic methods, achieving the best PSDS2 and IB-F1 and reducing parameters by about two-thirds relative to the strongest baselines, while also yielding temporally coherent, frequency-specific features. This work provides a physically grounded approach to SED that leverages frequency-band specialization to improve discrimination among overlapping sound events.
Abstract
Recently, 2D convolution has been found unqualified in sound event detection (SED). It enforces translation equivariance on sound events along frequency axis, which is not a shift-invariant dimension. To address this issue, dynamic convolution is used to model the frequency dependency of sound events. In this paper, we proposed the first full-dynamic method named full-frequency dynamic convolution (FFDConv). FFDConv generates frequency kernels for every frequency band, which is designed directly in the structure for frequency-dependent modeling. It physically furnished 2D convolution with the capability of frequency-dependent modeling. FFDConv outperforms not only the baseline by 6.6% in DESED real validation dataset in terms of PSDS1, but outperforms the other full-dynamic methods. In addition, by visualizing features of sound events, we observed that FFDConv could effectively extract coherent features in specific frequency bands, consistent with the vocal continuity of sound events. This proves that FFDConv has great frequency-dependent perception ability.
