Table of Contents
Fetching ...

Frequency-aware convolution for sound event detection

Tao Song, WenWen Zhang

TL;DR

This paper proposes a more efficient solution called frequency-aware convolution (FAC), which incorporates frequency positional information by encoding it in a vector, which is then explicitly added to the input spectrogram.

Abstract

In sound event detection (SED), convolutional neural networks (CNNs) are widely employed to extract time-frequency (TF) patterns from spectrograms. However, the ability of CNNs to recognize different sound events is limited by their insensitivity to shifts of TF patterns along the frequency dimension, caused by translation equivariance. To address this issue, a model called frequency dynamic convolution (FDY) has been proposed, which involves applying specific convolution kernels to different frequency components. However, FDY requires a significantly larger number of parameters and computational resources compared to a standard CNN. This paper proposes a more efficient solution called frequency-aware convolution (FAC). FAC incorporates frequency positional information by encoding it in a vector, which is then explicitly added to the input spectrogram. To ensure that the amplitude of the encoding vector matches that of the input spectrogram, the encoding vector is adaptively and channel-dependently scaled using self-attention. To evaluate the effectiveness of FAC, we conducted experiments within the context of the DCASE 2023 task 4. The results show that FAC achieves comparable performance to FDY while requiring only an additional 515 parameters, whereas FDY necessitates an additional 8.02 million parameters. Furthermore, an ablation study confirms that the adaptive and channel-dependent scaling of the encoding vector is critical to the performance of FAC.

Frequency-aware convolution for sound event detection

TL;DR

This paper proposes a more efficient solution called frequency-aware convolution (FAC), which incorporates frequency positional information by encoding it in a vector, which is then explicitly added to the input spectrogram.

Abstract

In sound event detection (SED), convolutional neural networks (CNNs) are widely employed to extract time-frequency (TF) patterns from spectrograms. However, the ability of CNNs to recognize different sound events is limited by their insensitivity to shifts of TF patterns along the frequency dimension, caused by translation equivariance. To address this issue, a model called frequency dynamic convolution (FDY) has been proposed, which involves applying specific convolution kernels to different frequency components. However, FDY requires a significantly larger number of parameters and computational resources compared to a standard CNN. This paper proposes a more efficient solution called frequency-aware convolution (FAC). FAC incorporates frequency positional information by encoding it in a vector, which is then explicitly added to the input spectrogram. To ensure that the amplitude of the encoding vector matches that of the input spectrogram, the encoding vector is adaptively and channel-dependently scaled using self-attention. To evaluate the effectiveness of FAC, we conducted experiments within the context of the DCASE 2023 task 4. The results show that FAC achieves comparable performance to FDY while requiring only an additional 515 parameters, whereas FDY necessitates an additional 8.02 million parameters. Furthermore, an ablation study confirms that the adaptive and channel-dependent scaling of the encoding vector is critical to the performance of FAC.
Paper Structure (9 sections, 7 equations, 5 figures, 1 table)

This paper contains 9 sections, 7 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Example of the insensitivity of a CNN module to shift of time-frequency patterns. The input spectrogram has 64 frequency bins, with only one bin has a nonzero value of 1. Shifting the TF pattern by 10 frequency bins does not change the output of the CNN module.
  • Figure 2: Frequency-aware convolution.
  • Figure 3: The curve of model performance gain with the number of FAC layers.
  • Figure 4: Encoding vector learned by each FAC layer.
  • Figure 5: Performance improvements of FAC-CRNN compared to CRNN under three conditions, fixed, adapt and adapt&indep.