Table of Contents
Fetching ...

Studying the Effect of Audio Filters in Pre-Trained Models for Environmental Sound Classification

Aditya Dawn, Wazib Ansar

TL;DR

This work targets Environmental Sound Classification by introducing a Two-Level Classification framework that first assigns broad sound categories and then refines to specific classes. It leverages pre-trained CNN backbones (VGG, ResNet, EfficientNet) and a set of audio modifiers (PCEN, Spectral Gating, Audio Crop, and audio filters) to generate diverse spectrogram inputs, enabling robust classification on the ESC-50 dataset. The approach achieves Level1 accuracy up to 78.75% and Level2 accuracy up to 98.04%, with Audio Crop and EfficientNet variants frequently delivering the best results. Overall, conditioning fine-grained classifiers on Level1 outputs and systematically evaluating modifiers provides practical guidance for improving environmental sound recognition systems.

Abstract

Environmental Sound Classification is an important problem of sound recognition and is more complicated than speech recognition problems as environmental sounds are not well structured with respect to time and frequency. Researchers have used various CNN models to learn audio features from different audio features like log mel spectrograms, gammatone spectral coefficients, mel-frequency spectral coefficients, generated from the audio files, over the past years. In this paper, we propose a new methodology : Two-Level Classification; the Level 1 Classifier will be responsible to classify the audio signal into a broader class and the Level 2 Classifiers will be responsible to find the actual class to which the audio belongs, based on the output of the Level 1 Classifier. We have also shown the effects of different audio filters, among which a new method of Audio Crop is introduced in this paper, which gave the highest accuracies in most of the cases. We have used the ESC-50 dataset for our experiment and obtained a maximum accuracy of 78.75% in case of Level 1 Classification and 98.04% in case of Level 2 Classifications.

Studying the Effect of Audio Filters in Pre-Trained Models for Environmental Sound Classification

TL;DR

This work targets Environmental Sound Classification by introducing a Two-Level Classification framework that first assigns broad sound categories and then refines to specific classes. It leverages pre-trained CNN backbones (VGG, ResNet, EfficientNet) and a set of audio modifiers (PCEN, Spectral Gating, Audio Crop, and audio filters) to generate diverse spectrogram inputs, enabling robust classification on the ESC-50 dataset. The approach achieves Level1 accuracy up to 78.75% and Level2 accuracy up to 98.04%, with Audio Crop and EfficientNet variants frequently delivering the best results. Overall, conditioning fine-grained classifiers on Level1 outputs and systematically evaluating modifiers provides practical guidance for improving environmental sound recognition systems.

Abstract

Environmental Sound Classification is an important problem of sound recognition and is more complicated than speech recognition problems as environmental sounds are not well structured with respect to time and frequency. Researchers have used various CNN models to learn audio features from different audio features like log mel spectrograms, gammatone spectral coefficients, mel-frequency spectral coefficients, generated from the audio files, over the past years. In this paper, we propose a new methodology : Two-Level Classification; the Level 1 Classifier will be responsible to classify the audio signal into a broader class and the Level 2 Classifiers will be responsible to find the actual class to which the audio belongs, based on the output of the Level 1 Classifier. We have also shown the effects of different audio filters, among which a new method of Audio Crop is introduced in this paper, which gave the highest accuracies in most of the cases. We have used the ESC-50 dataset for our experiment and obtained a maximum accuracy of 78.75% in case of Level 1 Classification and 98.04% in case of Level 2 Classifications.
Paper Structure (35 sections, 7 figures, 14 tables, 3 algorithms)

This paper contains 35 sections, 7 figures, 14 tables, 3 algorithms.

Figures (7)

  • Figure 1: CNN architecture
  • Figure 2: Illustration of the proposed workflow
  • Figure 3: Effect of Spectral Gating
  • Figure 4: Spectrograms of some Audio files
  • Figure 5: Spectrogram representation of PCEN done on two audio files
  • ...and 2 more figures