Temporal Attention Pooling for Frequency Dynamic Convolution in Sound Event Detection

Hyeonuk Nam; Yong-Hwa Park

Temporal Attention Pooling for Frequency Dynamic Convolution in Sound Event Detection

Hyeonuk Nam, Yong-Hwa Park

TL;DR

This work addresses the challenge of capturing transient sound events in SED by replacing temporal average pooling in frequency dynamic convolution (FDY conv) with Temporal Attention Pooling (TAP), forming Temporal Attention Pooling Frequency Dynamic Convolution (TFD conv). TAP integrates time and velocity attention with an auxiliary average pooling path to selectively weight temporal features, improving sensitivity to non-stationary events while preserving robustness for stationary signals. Ablation and cross-variant analyses demonstrate that TFD conv substantially boosts PSDS1, particularly for transient-heavy classes, and is compatible with FDY conv variants like DFD, PFD, and MDFD, with TAP + MDFD delivering a new state-of-the-art of 0.459 PSDS1 on DESED. The approach offers a generalizable, efficient path to enhance time-frequency feature extraction in SED, with practical implications for real-world audio event detection systems.

Abstract

Recent advances in deep learning, particularly frequency dynamic convolution (FDY conv), have significantly improved sound event detection (SED) by enabling frequency-adaptive feature extraction. However, FDY conv relies on temporal average pooling, which treats all temporal frames equally, limiting its ability to capture transient sound events such as alarm bells, door knocks, and speech plosives. To address this limitation, we propose temporal attention pooling frequency dynamic convolution (TFD conv) to replace temporal average pooling with temporal attention pooling (TAP). TAP adaptively weights temporal features through three complementary mechanisms: time attention pooling (TA) for emphasizing salient features, velocity attention pooling (VA) for capturing transient changes, and conventional average pooling for robustness to stationary signals. Ablation studies show that TFD conv improves average PSDS1 by 3.02% over FDY conv with only a 14.8% increase in parameter count. Classwise ANOVA and Tukey HSD analysis further demonstrate that TFD conv significantly enhances detection performance for transient-heavy events, outperforming existing FDY conv models. Notably, TFD conv achieves a maximum PSDS1 score of 0.456, surpassing previous state-of-the-art SED systems. We also explore the compatibility of TAP with other FDY conv variants, including dilated FDY conv (DFD conv), partial FDY conv (PFD conv), and multi-dilated FDY conv (MDFD conv). Among these, the integration of TAP with MDFD conv achieves the best result with a PSDS1 score of 0.459, validating the complementary strengths of temporal attention and multi-scale frequency adaptation. These findings establish TFD conv as a powerful and generalizable framework for enhancing both transient sensitivity and overall feature robustness in SED.

Temporal Attention Pooling for Frequency Dynamic Convolution in Sound Event Detection

TL;DR

Abstract

Temporal Attention Pooling for Frequency Dynamic Convolution in Sound Event Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)