Toward end-to-end interpretable convolutional neural networks for waveform signals

Linh Vu; Thu Tran; Wern-Han Lim; Raphael Phan

Toward end-to-end interpretable convolutional neural networks for waveform signals

Linh Vu, Thu Tran, Wern-Han Lim, Raphael Phan

TL;DR

The paper presents IConNet, an end-to-end interpretable CNN for raw waveform inputs that uses a FIR-based front-end with learnable window functions to improve efficiency and transparency. Across speech emotion recognition and abnormal heart sound detection, IConNet variants outperform traditional Mel/MFCC baselines and demonstrate competitive accuracy with a compact parameter footprint. The approach yields interpretable front-end filters, revealing how frequency bands are emphasized or suppressed, which is beneficial for healthcare applications where model transparency matters. Overall, this work advocates for end-to-end waveform processing with interpretable front-ends as a viable alternative to spectrogram-based features in audio ML tasks.

Abstract

This paper introduces a novel convolutional neural networks (CNN) framework tailored for end-to-end audio deep learning models, presenting advancements in efficiency and explainability. By benchmarking experiments on three standard speech emotion recognition datasets with five-fold cross-validation, our framework outperforms Mel spectrogram features by up to seven percent. It can potentially replace the Mel-Frequency Cepstral Coefficients (MFCC) while remaining lightweight. Furthermore, we demonstrate the efficiency and interpretability of the front-end layer using the PhysioNet Heart Sound Database, illustrating its ability to handle and capture intricate long waveform patterns. Our contributions offer a portable solution for building efficient and interpretable models for raw waveform data.

Toward end-to-end interpretable convolutional neural networks for waveform signals

TL;DR

Abstract

Paper Structure (14 sections, 4 equations, 4 figures, 3 tables)

This paper contains 14 sections, 4 equations, 4 figures, 3 tables.

Introduction
The IConNet architecture
Speech Emotion Recognition
Background
Experiment setup
Datasets
Classifiers and evaluation
Experiment results
Abnormal heart sound detection
Background
Proposed model
Experiment setup
Experiment results
Conclusions

Figures (4)

Figure 1: The proposed IConNet architecture for end-to-end audio classification: A- the front-end block containing the FIRconv layer; B- the proposed general architecture for end-to-end audio classification; C- the classifier used in the experiments.
Figure 2: Result on RAVDESS and CREMA-D datasets after 60 epochs
Figure 3: Comparison of Window Shape and Frequency Response of Filters from Different Bands. The chart displays the frequency response of low-range (a), mid-range (b), and high-range (c) frequency bands. The red line at -20dB represents the threshold at which noise is perceived as not noticeable.
Figure 4: Frequency response of filters from different bands

Toward end-to-end interpretable convolutional neural networks for waveform signals

TL;DR

Abstract

Toward end-to-end interpretable convolutional neural networks for waveform signals

Authors

TL;DR

Abstract

Table of Contents

Figures (4)