Table of Contents
Fetching ...

Temporal Attention Pooling for Frequency Dynamic Convolution in Sound Event Detection

Hyeonuk Nam, Yong-Hwa Park

TL;DR

This work addresses the challenge of capturing transient sound events in SED by replacing temporal average pooling in frequency dynamic convolution (FDY conv) with Temporal Attention Pooling (TAP), forming Temporal Attention Pooling Frequency Dynamic Convolution (TFD conv). TAP integrates time and velocity attention with an auxiliary average pooling path to selectively weight temporal features, improving sensitivity to non-stationary events while preserving robustness for stationary signals. Ablation and cross-variant analyses demonstrate that TFD conv substantially boosts PSDS1, particularly for transient-heavy classes, and is compatible with FDY conv variants like DFD, PFD, and MDFD, with TAP + MDFD delivering a new state-of-the-art of 0.459 PSDS1 on DESED. The approach offers a generalizable, efficient path to enhance time-frequency feature extraction in SED, with practical implications for real-world audio event detection systems.

Abstract

Recent advances in deep learning, particularly frequency dynamic convolution (FDY conv), have significantly improved sound event detection (SED) by enabling frequency-adaptive feature extraction. However, FDY conv relies on temporal average pooling, which treats all temporal frames equally, limiting its ability to capture transient sound events such as alarm bells, door knocks, and speech plosives. To address this limitation, we propose temporal attention pooling frequency dynamic convolution (TFD conv) to replace temporal average pooling with temporal attention pooling (TAP). TAP adaptively weights temporal features through three complementary mechanisms: time attention pooling (TA) for emphasizing salient features, velocity attention pooling (VA) for capturing transient changes, and conventional average pooling for robustness to stationary signals. Ablation studies show that TFD conv improves average PSDS1 by 3.02% over FDY conv with only a 14.8% increase in parameter count. Classwise ANOVA and Tukey HSD analysis further demonstrate that TFD conv significantly enhances detection performance for transient-heavy events, outperforming existing FDY conv models. Notably, TFD conv achieves a maximum PSDS1 score of 0.456, surpassing previous state-of-the-art SED systems. We also explore the compatibility of TAP with other FDY conv variants, including dilated FDY conv (DFD conv), partial FDY conv (PFD conv), and multi-dilated FDY conv (MDFD conv). Among these, the integration of TAP with MDFD conv achieves the best result with a PSDS1 score of 0.459, validating the complementary strengths of temporal attention and multi-scale frequency adaptation. These findings establish TFD conv as a powerful and generalizable framework for enhancing both transient sensitivity and overall feature robustness in SED.

Temporal Attention Pooling for Frequency Dynamic Convolution in Sound Event Detection

TL;DR

This work addresses the challenge of capturing transient sound events in SED by replacing temporal average pooling in frequency dynamic convolution (FDY conv) with Temporal Attention Pooling (TAP), forming Temporal Attention Pooling Frequency Dynamic Convolution (TFD conv). TAP integrates time and velocity attention with an auxiliary average pooling path to selectively weight temporal features, improving sensitivity to non-stationary events while preserving robustness for stationary signals. Ablation and cross-variant analyses demonstrate that TFD conv substantially boosts PSDS1, particularly for transient-heavy classes, and is compatible with FDY conv variants like DFD, PFD, and MDFD, with TAP + MDFD delivering a new state-of-the-art of 0.459 PSDS1 on DESED. The approach offers a generalizable, efficient path to enhance time-frequency feature extraction in SED, with practical implications for real-world audio event detection systems.

Abstract

Recent advances in deep learning, particularly frequency dynamic convolution (FDY conv), have significantly improved sound event detection (SED) by enabling frequency-adaptive feature extraction. However, FDY conv relies on temporal average pooling, which treats all temporal frames equally, limiting its ability to capture transient sound events such as alarm bells, door knocks, and speech plosives. To address this limitation, we propose temporal attention pooling frequency dynamic convolution (TFD conv) to replace temporal average pooling with temporal attention pooling (TAP). TAP adaptively weights temporal features through three complementary mechanisms: time attention pooling (TA) for emphasizing salient features, velocity attention pooling (VA) for capturing transient changes, and conventional average pooling for robustness to stationary signals. Ablation studies show that TFD conv improves average PSDS1 by 3.02% over FDY conv with only a 14.8% increase in parameter count. Classwise ANOVA and Tukey HSD analysis further demonstrate that TFD conv significantly enhances detection performance for transient-heavy events, outperforming existing FDY conv models. Notably, TFD conv achieves a maximum PSDS1 score of 0.456, surpassing previous state-of-the-art SED systems. We also explore the compatibility of TAP with other FDY conv variants, including dilated FDY conv (DFD conv), partial FDY conv (PFD conv), and multi-dilated FDY conv (MDFD conv). Among these, the integration of TAP with MDFD conv achieves the best result with a PSDS1 score of 0.459, validating the complementary strengths of temporal attention and multi-scale frequency adaptation. These findings establish TFD conv as a powerful and generalizable framework for enhancing both transient sensitivity and overall feature robustness in SED.

Paper Structure

This paper contains 31 sections, 11 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Overview of the proposed temporal attention pooling frequency dynamic convolution (TFD conv). The left side illustrates the overall architecture of the TFD conv-based SED model, where TFD conv layers replace standard FDY conv layers for enhanced time-frequency adaptive feature extraction. The right side provides a detailed breakdown of the temporal attention pooling (TAP) mechanism, which replaces temporal average pooling in FDY conv. TAP consists of three pooling components: (a) time attention pooling (TA), which dynamically weights salient temporal regions, (b) velocity attention pooling (VA), which applies attention based on temporal differences to emphasize transient events, and (c) average pooling to maintain robustness for stationary sound events. By integrating TAP with frequency-adaptive convolution kernels, TFD conv improves the recognition of transient and quasi-stationary sound events.
  • Figure 2: Illustration of different frequency dynamic convolution (FDY conv) variants. (a) FDY conv: Introduces frequency-adaptive convolution kernels to release the translational equivariance of conventional 2D convolution FDY. (b) DFD conv: Incorporates dilated basis kernels to expand the spectral receptive field and diversify frequency-adaptive kernels DFD. (c) PFD conv: Introduces a static branch alongside the FDY conv dynamic branch to reduce model complexity PFD. (d) MDFD conv: Extends DFD and PFD by integrating multiple dilated dynamic branches within a single static branch for improved feature extraction PFD.
  • Figure 3: Overview of the proposed temporal attention pooling (TAP) mechanism. TAP consists of three pooling branches: (a) Attention pooling applies softmax-based attention to highlight salient temporal features. (b) Velocity attention pooling incorporates temporal differences ($\Delta x$) and applies softmax weighting to emphasize transient sound patterns. (c) Average pooling captures global temporal context by computing the mean over time. The outputs of these three pooling operations are summed to obtain the final TAP feature.
  • Figure 4: Classwise F1 score distribution of different models across ten sound event classes.