Table of Contents
Fetching ...

Enhancing 1-Second 3D SELD Performance with Filter Bank Analysis and SCConv Integration in CST-Former

Zhehui Zhang

TL;DR

This paper investigates SELD with distance estimation (3D SELD) systems under short-time segments, specifically targeting a 1-second window, establishing a new baseline for practical 3D SELD applicability.

Abstract

Recent SELD research has predominantly focused on long-time segment scenarios (typically 5 to 10 seconds, occasionally 2 seconds), improving benchmark performance but lacking the temporal granularity needed for real-world applications. To bridge this gap, this paper investigates SELD with distance estimation (3D SELD) systems under short-time segments, specifically targeting a 1-second window, establishing a new baseline for practical 3D SELD applicability. We further explore the impact of different filter banks -- Bark, Mel, and Gammatone for audio feature extraction, and experimental results demonstrate that the Gammatone filter achieves the highest overall accuracy in this context. Finally, we propose replacing the convolutional modules within the CST-Former, a competitive SELD architecture, with the SCConv module. This adjustment yields measurable F-score gains in short-segment scenarios, underscoring SCConv's potential to improve spatial and channel feature representation. The experimental results highlight our approach as a significant step towards the real-world deployment of 3D SELD systems under low-latency constraints.

Enhancing 1-Second 3D SELD Performance with Filter Bank Analysis and SCConv Integration in CST-Former

TL;DR

This paper investigates SELD with distance estimation (3D SELD) systems under short-time segments, specifically targeting a 1-second window, establishing a new baseline for practical 3D SELD applicability.

Abstract

Recent SELD research has predominantly focused on long-time segment scenarios (typically 5 to 10 seconds, occasionally 2 seconds), improving benchmark performance but lacking the temporal granularity needed for real-world applications. To bridge this gap, this paper investigates SELD with distance estimation (3D SELD) systems under short-time segments, specifically targeting a 1-second window, establishing a new baseline for practical 3D SELD applicability. We further explore the impact of different filter banks -- Bark, Mel, and Gammatone for audio feature extraction, and experimental results demonstrate that the Gammatone filter achieves the highest overall accuracy in this context. Finally, we propose replacing the convolutional modules within the CST-Former, a competitive SELD architecture, with the SCConv module. This adjustment yields measurable F-score gains in short-segment scenarios, underscoring SCConv's potential to improve spatial and channel feature representation. The experimental results highlight our approach as a significant step towards the real-world deployment of 3D SELD systems under low-latency constraints.

Paper Structure

This paper contains 17 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Flowchart of the data preprocessing pipeline. The arrows denote non-parametric operations, while the blue boxes represent different data segments. The symbol $\oplus$ indicates the concatenation of extracted features, integrating them into the final input representation. In this figure, $T_i$ represents the time-dimension frames, and $B_i$ refers to the frequency-dimension count.
  • Figure 2: Overview of the SCConv CST-former architecture. The left panel presents the full model, while the middle and right panels show the detailed designs of the SCConv CST and SCConv blocks. The SCConv block, replacing the original Local Perception Unit and Inverted Residual FNN in the CST block, improves local feature extraction and fusion. Here, $N$ represents the batch size, $T$ is the time-dimension frame count, $B$ refers to the frequency-dimension frame count, and $Class$ denotes the total number of sound event classes.
  • Figure 3: Comparison of F-scores for four models (SELD2024, Conv-Conformer, CST-Former, and SCConv CST-Former) across three different filter types (Mel, Bark and Gammatone).