Table of Contents
Fetching ...

Investigation of Feature Selection and Pooling Methods for Environmental Sound Classification

Parinaz Binandeh Dehaghani, Danilo Pena, A. Pedro Aguiar

TL;DR

This study tackles efficient environmental sound classification on resource-constrained devices by comparing PCA-based dimensionality reduction with sparse salient region pooling (SSRP) in lightweight CNNs. It investigates two SSRP variants, SSRP-B and SSRP-T, on ESC-50 and identifies hyperparameters that maximize performance, notably W=4 for SSRP-B and K=12 for SSRP-T, with SSRP-T achieving 80.69% accuracy. In contrast, PCA dramatically reduces dimensionality from 17,120 to 101 features but drops accuracy to 37.60%, highlighting the importance of preserving local time-frequency structure. The findings show that task-aware sparse pooling yields robust, high-accuracy ESC with reasonable computational cost, suggesting promising directions for integrating SSRP with attention mechanisms and transformer-based models.

Abstract

This paper explores the impact of dimensionality reduction and pooling methods for Environmental Sound Classification (ESC) using lightweight CNNs. We evaluate Sparse Salient Region Pooling (SSRP) and its variants, SSRP-Basic (SSRP-B) and SSRP-Top-K (SSRP-T), under various hyperparameter settings and compare them with Principal Component Analysis (PCA). Experiments on the ESC-50 dataset demonstrate that SSRP-T achieves up to 80.69 % accuracy, significantly outperforming both the baseline CNN (66.75 %) and the PCA-reduced model (37.60 %). Our findings confirm that a well-tuned sparse pooling strategy provides a robust, efficient, and high-performing solution for ESC tasks, particularly in resource-constrained scenarios where balancing accuracy and computational cost is crucial.

Investigation of Feature Selection and Pooling Methods for Environmental Sound Classification

TL;DR

This study tackles efficient environmental sound classification on resource-constrained devices by comparing PCA-based dimensionality reduction with sparse salient region pooling (SSRP) in lightweight CNNs. It investigates two SSRP variants, SSRP-B and SSRP-T, on ESC-50 and identifies hyperparameters that maximize performance, notably W=4 for SSRP-B and K=12 for SSRP-T, with SSRP-T achieving 80.69% accuracy. In contrast, PCA dramatically reduces dimensionality from 17,120 to 101 features but drops accuracy to 37.60%, highlighting the importance of preserving local time-frequency structure. The findings show that task-aware sparse pooling yields robust, high-accuracy ESC with reasonable computational cost, suggesting promising directions for integrating SSRP with attention mechanisms and transformer-based models.

Abstract

This paper explores the impact of dimensionality reduction and pooling methods for Environmental Sound Classification (ESC) using lightweight CNNs. We evaluate Sparse Salient Region Pooling (SSRP) and its variants, SSRP-Basic (SSRP-B) and SSRP-Top-K (SSRP-T), under various hyperparameter settings and compare them with Principal Component Analysis (PCA). Experiments on the ESC-50 dataset demonstrate that SSRP-T achieves up to 80.69 % accuracy, significantly outperforming both the baseline CNN (66.75 %) and the PCA-reduced model (37.60 %). Our findings confirm that a well-tuned sparse pooling strategy provides a robust, efficient, and high-performing solution for ESC tasks, particularly in resource-constrained scenarios where balancing accuracy and computational cost is crucial.

Paper Structure

This paper contains 18 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Log-Mel spectrogram representation of an audio sample used as input to the CNN. The spectrogram has a shape of (431, 40, 1), corresponding to 431 time frames and 40 mel frequency bins. The color intensity indicates signal power in decibels (dB) across time and frequency.
  • Figure 2: Common CNN Backbone Architecture
  • Figure 3: Cumulative explained variance curve for PCA applied to the ESC-50 log-Mel spectrogram features. The red dashed line indicates the 95% variance threshold, achieved with 101 components.
  • Figure 4: Comparison of validation accuracy for different window sizes in SSRP-B pooling. (a) Validation accuracy for SSRP-B with $W=4$. (b) Validation accuracy for SSRP-B with $W=6$.
  • Figure 5: Comparison of validation accuracy for different values of K in SSRP-T pooling. (a) Validation accuracy for SSRP-T with $K=4$. (b) Validation accuracy for SSRP-T with $K=12$.