Investigation of Feature Selection and Pooling Methods for Environmental Sound Classification
Parinaz Binandeh Dehaghani, Danilo Pena, A. Pedro Aguiar
TL;DR
This study tackles efficient environmental sound classification on resource-constrained devices by comparing PCA-based dimensionality reduction with sparse salient region pooling (SSRP) in lightweight CNNs. It investigates two SSRP variants, SSRP-B and SSRP-T, on ESC-50 and identifies hyperparameters that maximize performance, notably W=4 for SSRP-B and K=12 for SSRP-T, with SSRP-T achieving 80.69% accuracy. In contrast, PCA dramatically reduces dimensionality from 17,120 to 101 features but drops accuracy to 37.60%, highlighting the importance of preserving local time-frequency structure. The findings show that task-aware sparse pooling yields robust, high-accuracy ESC with reasonable computational cost, suggesting promising directions for integrating SSRP with attention mechanisms and transformer-based models.
Abstract
This paper explores the impact of dimensionality reduction and pooling methods for Environmental Sound Classification (ESC) using lightweight CNNs. We evaluate Sparse Salient Region Pooling (SSRP) and its variants, SSRP-Basic (SSRP-B) and SSRP-Top-K (SSRP-T), under various hyperparameter settings and compare them with Principal Component Analysis (PCA). Experiments on the ESC-50 dataset demonstrate that SSRP-T achieves up to 80.69 % accuracy, significantly outperforming both the baseline CNN (66.75 %) and the PCA-reduced model (37.60 %). Our findings confirm that a well-tuned sparse pooling strategy provides a robust, efficient, and high-performing solution for ESC tasks, particularly in resource-constrained scenarios where balancing accuracy and computational cost is crucial.
