BUET Multi-disease Heart Sound Dataset: A Comprehensive Auscultation Dataset for Developing Computer-Aided Diagnostic Systems
Shams Nafisa Ali, Afia Zahin, Samiul Based Shuvo, Nusrat Binta Nizam, Shoyad Ibn Sabur Khan Nuhash, Sayeed Sajjad Razin, S. M. Sakeef Sani, Farihin Rahman, Nawshad Binta Nizam, Farhat Binte Azam, Rakib Hossen, Sumaiya Ohab, Nawsabah Noor, Taufiq Hasan
TL;DR
The paper introduces the BUET Multi-disease Heart Sound (BMD-HS) dataset, a rigorously curated collection of 864 phonocardiogram recordings with six classes (Normal, AS, AR, MR, MS, MD) and multi-label annotations, echocardiogram-confirmed diagnoses, and rich metadata to support AI-driven cardiovascular diagnostics. It details standardized data collection across four auscultation sites, 108 subjects, and eight 20-second recordings per subject, aiming to reduce device- and site-bias while enabling region-specific CVD research in Bangladesh. A benchmarking study compares CNN-based models with and without metadata fusion against recurrent architectures (LSTM/GRU), showing the primary CNN+metadata model achieving the best performance (accuracy ~0.80) and demonstrating that temporal sequence modeling may be less beneficial for this task. The dataset addresses limitations of existing public PCG resources by providing multi-label disease states, comprehensive demographic context, and echocardiogram validation, thereby enabling more nuanced learning and broader applicability in resource-constrained settings and global health research.
Abstract
Cardiac auscultation, an integral tool in diagnosing cardiovascular diseases (CVDs), often relies on the subjective interpretation of clinicians, presenting a limitation in consistency and accuracy. Addressing this, we introduce the BUET Multi-disease Heart Sound (BMD-HS) dataset - a comprehensive and meticulously curated collection of heart sound recordings. This dataset, encompassing 864 recordings across five distinct classes of common heart sounds, represents a broad spectrum of valvular heart diseases, with a focus on diagnostically challenging cases. The standout feature of the BMD-HS dataset is its innovative multi-label annotation system, which captures a diverse range of diseases and unique disease states. This system significantly enhances the dataset's utility for developing advanced machine learning models in automated heart sound classification and diagnosis. By bridging the gap between traditional auscultation practices and contemporary data-driven diagnostic methods, the BMD-HS dataset is poised to revolutionize CVD diagnosis and management, providing an invaluable resource for the advancement of cardiac health research. The dataset is publicly available at this link: https://github.com/mHealthBuet/BMD-HS-Dataset.
