Table of Contents
Fetching ...

ENACT-Heart -- ENsemble-based Assessment Using CNN and Transformer on Heart Sounds

Jiho Han, Adnan Shaout

TL;DR

ENACT-Heart presents a multimodal, MoE-based ensemble for heart-sound classification by combining ViT on spectrograms with CNN on centroid visualizations. Through Gaussian data augmentation and audiovisual data generation, the approach achieves 97.52% accuracy, outperforming the individual ViT and CNN baselines. The study demonstrates that leveraging complementary representations via a gating-based ensemble can enhance diagnostic accuracy in cardiovascular monitoring, with strong implications for wearable-based health surveillance. Overall, ENACT-Heart offers a robust, practical pathway toward high-accuracy, noninvasive cardiac diagnostics in real-world settings.

Abstract

This study explores the application of Vision Transformer (ViT) principles in audio analysis, specifically focusing on heart sounds. This paper introduces ENACT-Heart - a novel ensemble approach that leverages the complementary strengths of Convolutional Neural Networks (CNN) and ViT through a Mixture of Experts (MoE) framework, achieving a remarkable classification accuracy of 97.52%. This outperforms the individual contributions of ViT (93.88%) and CNN (95.45%), demonstrating the potential for enhanced diagnostic accuracy in cardiovascular health monitoring. These results demonstrate the potential of ensemble methods in enhancing classification performance for cardiovascular health monitoring and diagnosis.

ENACT-Heart -- ENsemble-based Assessment Using CNN and Transformer on Heart Sounds

TL;DR

ENACT-Heart presents a multimodal, MoE-based ensemble for heart-sound classification by combining ViT on spectrograms with CNN on centroid visualizations. Through Gaussian data augmentation and audiovisual data generation, the approach achieves 97.52% accuracy, outperforming the individual ViT and CNN baselines. The study demonstrates that leveraging complementary representations via a gating-based ensemble can enhance diagnostic accuracy in cardiovascular monitoring, with strong implications for wearable-based health surveillance. Overall, ENACT-Heart offers a robust, practical pathway toward high-accuracy, noninvasive cardiac diagnostics in real-world settings.

Abstract

This study explores the application of Vision Transformer (ViT) principles in audio analysis, specifically focusing on heart sounds. This paper introduces ENACT-Heart - a novel ensemble approach that leverages the complementary strengths of Convolutional Neural Networks (CNN) and ViT through a Mixture of Experts (MoE) framework, achieving a remarkable classification accuracy of 97.52%. This outperforms the individual contributions of ViT (93.88%) and CNN (95.45%), demonstrating the potential for enhanced diagnostic accuracy in cardiovascular health monitoring. These results demonstrate the potential of ensemble methods in enhancing classification performance for cardiovascular health monitoring and diagnosis.

Paper Structure

This paper contains 22 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The pipeline of our ENACT-Heart consists of three core steps: data augmentation, expert analysis of each modality, and analysis fusion. 1) Data augmentation through Gaussian Noise allows increased variability and generalization of the overall model. 2) Spectrogram analysis is performed through ViT, and audiovisual diagram analysis is done through CNN, allowing each model to leverage its strengths in feature extraction for different modalities.
  • Figure 2: Overview of the ENACT-Heart Workflow. The flowchart outlines the data preparation, model training, and ensemble process used to combine ViT and CNN predictions.
  • Figure 3: Confusion matrix of the performance of ENACT-Heart