Table of Contents
Fetching ...

A Generalist Audio Foundation Model for Comprehensive Body Sound Auscultation

Pingjie Wang, Liudan Zhao, Zihan Zhao, Miao He, Xin Sun, Ya Zhang, Kun Sun, Yanfeng Wang, Yu Wang

TL;DR

AuscultaBase introduces a generalist foundation model for body-sound auscultation by unifying heart, lung, and bowel sounds through large-scale, self-supervised pretraining on AuscultaCorpus. It is evaluated on AuscultaBench, a 16-task benchmark spanning abnormality detection and disease diagnosis, where AuscultaBase consistently outperforms state-of-the-art baselines and demonstrates robustness across sound types and data imbalances. A clinical comparison with pediatric cardiology experts reveals higher sensitivity and strong accuracy, especially in younger patients, supporting its potential as a diagnostic assistant. The work provides a scalable framework and benchmark for AI-enabled auscultation, with open-source code and model checkpoints to foster further research and clinical translation.

Abstract

Accurate and efficient auscultation-based diagnostics are vital for early disease detection, especially in resource-limited settings where specialized clinical expertise is scarce. Traditional auscultation, which heavily depends on clinician experience, suffers from significant inter-observer variability, while existing AI models often falter due to the limitations of non-representative training data. In this study, we introduce AuscultaBase, a novel AI-driven diagnostic framework that harnesses self-supervised and contrastive learning techniques alongside large-scale, multi-source data integration to advance body sound analysis. By generating robust feature representations, AuscultaBase markedly enhances performance in abnormality detection, disease classification, and activity recognition tasks. Comprehensive evaluations on our newly established benchmark, AuscultaBench, demonstrate that AuscultaBase consistently outperforms state-of-the-art methods across key performance metrics, underscoring its potential as a scalable and cost-effective tool for clinical screening and early disease intervention. The code and model checkpoint has been released in https://github.com/applewpj/AuscultaBase.

A Generalist Audio Foundation Model for Comprehensive Body Sound Auscultation

TL;DR

AuscultaBase introduces a generalist foundation model for body-sound auscultation by unifying heart, lung, and bowel sounds through large-scale, self-supervised pretraining on AuscultaCorpus. It is evaluated on AuscultaBench, a 16-task benchmark spanning abnormality detection and disease diagnosis, where AuscultaBase consistently outperforms state-of-the-art baselines and demonstrates robustness across sound types and data imbalances. A clinical comparison with pediatric cardiology experts reveals higher sensitivity and strong accuracy, especially in younger patients, supporting its potential as a diagnostic assistant. The work provides a scalable framework and benchmark for AI-enabled auscultation, with open-source code and model checkpoints to foster further research and clinical translation.

Abstract

Accurate and efficient auscultation-based diagnostics are vital for early disease detection, especially in resource-limited settings where specialized clinical expertise is scarce. Traditional auscultation, which heavily depends on clinician experience, suffers from significant inter-observer variability, while existing AI models often falter due to the limitations of non-representative training data. In this study, we introduce AuscultaBase, a novel AI-driven diagnostic framework that harnesses self-supervised and contrastive learning techniques alongside large-scale, multi-source data integration to advance body sound analysis. By generating robust feature representations, AuscultaBase markedly enhances performance in abnormality detection, disease classification, and activity recognition tasks. Comprehensive evaluations on our newly established benchmark, AuscultaBench, demonstrate that AuscultaBase consistently outperforms state-of-the-art methods across key performance metrics, underscoring its potential as a scalable and cost-effective tool for clinical screening and early disease intervention. The code and model checkpoint has been released in https://github.com/applewpj/AuscultaBase.

Paper Structure

This paper contains 84 sections, 3 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: System overview of our framework. During the Data Construction stage, the AuscultaCorpus is first collected by stethoscopes and then pre-processed into spectrograms. Followed by the AuscultaCorpus, the audio encoder is pre-trained with contrastive learning to obtain the base model for auscultation, AuscultaBase. Finally, we apply linear/full fine-tuning to AuscultaBase to adapt to 16 downstream auscultation tasks within our evaluation benchmark, AuscultaBench. Additionally, we conduct clinical evaluations between our model and experienced clinicians to analyze the auscultation performance, extending the prospect for various scenarios of clinical applications.
  • Figure 2: Statistics of AuscultaBench. The Sankey diagram shows how the different body sound types (left), datasets (left middle), tasks (right middle), and abnormality or disease categories (right) contribute to the final evaluation benchmark. On the left of the bottom, two bar charts show the data distributions on cardiac and respiratory entities respectively.
  • Figure 3: Receiver operating characteristic (ROC) curve and average AUC of binary classification tasks (T4, T9, T14, T15, and T16) with 3 independent runs (labeled with 3 colors).
  • Figure 4: Borda count scores (BCS) categorized by the task function (abnormality detection and disease diagnosis), sound type (lung sound, heart sound, and bowel sound), and task type (binary classification, multiclass classification, and multilabel classification). BCS ranges from 1 to 5, and higher is better.
  • Figure 5: The statistics of test samples and diagnostic performance comparison between human and AuscultaBase. (a) The overview of the clinical evaluation process. (b) The distribution of test samples across gender and age. (c) The confusion matrices of the diagnostic results derived by the human and AuscultaBase. (d) The sensitivity, specificity, and diagnostic accuracy across different genders and age groups. The sensitivity, specificity, and accuracy range from 0 to 1, and higher is better.
  • ...and 8 more figures