ENACT-Heart -- ENsemble-based Assessment Using CNN and Transformer on Heart Sounds
Jiho Han, Adnan Shaout
TL;DR
ENACT-Heart presents a multimodal, MoE-based ensemble for heart-sound classification by combining ViT on spectrograms with CNN on centroid visualizations. Through Gaussian data augmentation and audiovisual data generation, the approach achieves 97.52% accuracy, outperforming the individual ViT and CNN baselines. The study demonstrates that leveraging complementary representations via a gating-based ensemble can enhance diagnostic accuracy in cardiovascular monitoring, with strong implications for wearable-based health surveillance. Overall, ENACT-Heart offers a robust, practical pathway toward high-accuracy, noninvasive cardiac diagnostics in real-world settings.
Abstract
This study explores the application of Vision Transformer (ViT) principles in audio analysis, specifically focusing on heart sounds. This paper introduces ENACT-Heart - a novel ensemble approach that leverages the complementary strengths of Convolutional Neural Networks (CNN) and ViT through a Mixture of Experts (MoE) framework, achieving a remarkable classification accuracy of 97.52%. This outperforms the individual contributions of ViT (93.88%) and CNN (95.45%), demonstrating the potential for enhanced diagnostic accuracy in cardiovascular health monitoring. These results demonstrate the potential of ensemble methods in enhancing classification performance for cardiovascular health monitoring and diagnosis.
