Table of Contents
Fetching ...

BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification

June-Woo Kim, Miika Toikkanen, Yera Choi, Seoung-Eun Moon, Ho-Young Jung

TL;DR

The paper addresses variability in respiratory sound classification caused by patient demographics and recording environments by introducing BTS, a text-audio multimodal model that uses free-text metadata descriptions generated from recording attributes. By fine-tuning a CLAP-based framework to jointly encode metadata-derived text and audio, BTS forms multimodal representations that improve classification of respiratory sounds on the ICBHI dataset, achieving a new state of the art with a $1.17\%$ gain in Score. The authors also show that BTS remains robust when metadata is unknown or missing, and they analyze which metadata types contribute most to performance—particularly recording location and device. This work highlights the practical value of metadata-aware multimodal learning for clinical respiratory diagnostics and provides a reusable approach for leveraging descriptive metadata in similar tasks.

Abstract

Respiratory sound classification (RSC) is challenging due to varied acoustic signatures, primarily influenced by patient demographics and recording environments. To address this issue, we introduce a text-audio multimodal model that utilizes metadata of respiratory sounds, which provides useful complementary information for RSC. Specifically, we fine-tune a pretrained text-audio multimodal model using free-text descriptions derived from the sound samples' metadata which includes the gender and age of patients, type of recording devices, and recording location on the patient's body. Our method achieves state-of-the-art performance on the ICBHI dataset, surpassing the previous best result by a notable margin of 1.17%. This result validates the effectiveness of leveraging metadata and respiratory sound samples in enhancing RSC performance. Additionally, we investigate the model performance in the case where metadata is partially unavailable, which may occur in real-world clinical setting.

BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification

TL;DR

The paper addresses variability in respiratory sound classification caused by patient demographics and recording environments by introducing BTS, a text-audio multimodal model that uses free-text metadata descriptions generated from recording attributes. By fine-tuning a CLAP-based framework to jointly encode metadata-derived text and audio, BTS forms multimodal representations that improve classification of respiratory sounds on the ICBHI dataset, achieving a new state of the art with a gain in Score. The authors also show that BTS remains robust when metadata is unknown or missing, and they analyze which metadata types contribute most to performance—particularly recording location and device. This work highlights the practical value of metadata-aware multimodal learning for clinical respiratory diagnostics and provides a reusable approach for leveraging descriptive metadata in similar tasks.

Abstract

Respiratory sound classification (RSC) is challenging due to varied acoustic signatures, primarily influenced by patient demographics and recording environments. To address this issue, we introduce a text-audio multimodal model that utilizes metadata of respiratory sounds, which provides useful complementary information for RSC. Specifically, we fine-tune a pretrained text-audio multimodal model using free-text descriptions derived from the sound samples' metadata which includes the gender and age of patients, type of recording devices, and recording location on the patient's body. Our method achieves state-of-the-art performance on the ICBHI dataset, surpassing the previous best result by a notable margin of 1.17%. This result validates the effectiveness of leveraging metadata and respiratory sound samples in enhancing RSC performance. Additionally, we investigate the model performance in the case where metadata is partially unavailable, which may occur in real-world clinical setting.
Paper Structure (17 sections, 2 equations, 1 figure, 6 tables)

This paper contains 17 sections, 2 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: An overall illustration of the proposed BTS architecture. The pretrained text and audio encoders extract feature representations of text description derived from metadata and respiratory sound samples, respectively. After the projection, the representations are integrated by a concatenation operation and used for RSC.