Performance Comparison of CNN and AST Models with Stacked Features for Environmental Sound Classification
Parinaz Binandeh Dehaghania, Danilo Penab, A. Pedro Aguiar
TL;DR
The paper tackles environmental sound classification (ESC) under data- and resource-constrained conditions by evaluating CNNs trained on stacked feature representations (Log-Mel, MFCC, GTCC, Chroma, Spectral Contrast, Tonnetz) and comparing them to an Audio Spectrogram Transformer (AST) across ESC-50 and UrbanSound8K with varying pretraining. It demonstrates that feature-stacked CNNs offer data- and compute-efficient performance, with CNN-1 achieving up to 92.46% validation accuracy on ESC-50 when pre-trained on ESC-50 and fine-tuned on UrbanSound8K, while AST benefits substantially from large-scale pretraining (e.g., Audioset) but under moderate data remains less competitive. The results highlight a practical trade-off: lightweight stacked-CNNs are well-suited for edge and real-time deployments, whereas transformer-based models excel with abundant pretraining resources. These insights guide deployment decisions for ESC systems in constrained environments and point to future work on online/adaptive learning for dynamic acoustics.
Abstract
Environmental sound classification (ESC) has gained significant attention due to its diverse applications in smart city monitoring, fault detection, acoustic surveillance, and manufacturing quality control. To enhance CNN performance, feature stacking techniques have been explored to aggregate complementary acoustic descriptors into richer input representations. In this paper, we investigate CNN-based models employing various stacked feature combinations, including Log-Mel Spectrogram (LM), Spectral Contrast (SPC), Chroma (CH), Tonnetz (TZ), Mel-Frequency Cepstral Coefficients (MFCCs), and Gammatone Cepstral Coefficients (GTCC). Experiments are conducted on the widely used ESC-50 and UrbanSound8K datasets under different training regimes, including pretraining on ESC-50, fine-tuning on UrbanSound8K, and comparison with Audio Spectrogram Transformer (AST) models pretrained on large-scale corpora such as AudioSet. This experimental design enables an analysis of how feature-stacked CNNs compare with transformer-based models under varying levels of training data and pretraining diversity. The results indicate that feature-stacked CNNs offer a more computationally and data-efficient alternative when large-scale pretraining or extensive training data are unavailable, making them particularly well suited for resource-constrained and edge-level sound classification scenarios.
