Table of Contents
Fetching ...

RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification

June-Woo Kim, Miika Toikkanen, Sangmin Bae, Minseok Kim, Ho-Young Jung

TL;DR

The paper addresses data scarcity and modality mismatch in respiratory sound classification by evaluating pretrained speech models and proposing RepAugment, an input-agnostic representation-level augmentation. RepAugment combines Rep-Mask and Rep-Gen to perturb model representations before the classifier, enabling applicability across input types and backbones. Across experiments on the ICBHI dataset, RepAugment outperforms SpecAugment in several settings and yields substantial gains for minority classes, with up to 7.14 percentage points improvement, achieving state-of-the-art-like results for some backbones. The work also identifies a domain gap between speech and respiratory sounds via t-SNE, underscoring the potential of representation-level augmentation to improve clinical diagnostics for abnormal lung sounds.

Abstract

Recent advancements in AI have democratized its deployment as a healthcare assistant. While pretrained models from large-scale visual and audio datasets have demonstrably generalized to this task, surprisingly, no studies have explored pretrained speech models, which, as human-originated sounds, intuitively would share closer resemblance to lung sounds. This paper explores the efficacy of pretrained speech models for respiratory sound classification. We find that there is a characterization gap between speech and lung sound samples, and to bridge this gap, data augmentation is essential. However, the most widely used augmentation technique for audio and speech, SpecAugment, requires 2-dimensional spectrogram format and cannot be applied to models pretrained on speech waveforms. To address this, we propose RepAugment, an input-agnostic representation-level augmentation technique that outperforms SpecAugment, but is also suitable for respiratory sound classification with waveform pretrained models. Experimental results show that our approach outperforms the SpecAugment, demonstrating a substantial improvement in the accuracy of minority disease classes, reaching up to 7.14%.

RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification

TL;DR

The paper addresses data scarcity and modality mismatch in respiratory sound classification by evaluating pretrained speech models and proposing RepAugment, an input-agnostic representation-level augmentation. RepAugment combines Rep-Mask and Rep-Gen to perturb model representations before the classifier, enabling applicability across input types and backbones. Across experiments on the ICBHI dataset, RepAugment outperforms SpecAugment in several settings and yields substantial gains for minority classes, with up to 7.14 percentage points improvement, achieving state-of-the-art-like results for some backbones. The work also identifies a domain gap between speech and respiratory sounds via t-SNE, underscoring the potential of representation-level augmentation to improve clinical diagnostics for abnormal lung sounds.

Abstract

Recent advancements in AI have democratized its deployment as a healthcare assistant. While pretrained models from large-scale visual and audio datasets have demonstrably generalized to this task, surprisingly, no studies have explored pretrained speech models, which, as human-originated sounds, intuitively would share closer resemblance to lung sounds. This paper explores the efficacy of pretrained speech models for respiratory sound classification. We find that there is a characterization gap between speech and lung sound samples, and to bridge this gap, data augmentation is essential. However, the most widely used augmentation technique for audio and speech, SpecAugment, requires 2-dimensional spectrogram format and cannot be applied to models pretrained on speech waveforms. To address this, we propose RepAugment, an input-agnostic representation-level augmentation technique that outperforms SpecAugment, but is also suitable for respiratory sound classification with waveform pretrained models. Experimental results show that our approach outperforms the SpecAugment, demonstrating a substantial improvement in the accuracy of minority disease classes, reaching up to 7.14%.
Paper Structure (20 sections, 3 equations, 2 figures, 3 tables)

This paper contains 20 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Difference between the input-level and feature-level augmentation for respiratory sound classification. Our proposed input-agnostic RepAugment which consists of Rep-Mask and Rep-Gen (at bottom right) can be employed for any input type, whereas the SpecAugment (at top left) can only be applied to the input spectrogram.
  • Figure 2: t-SNE results of pretrained HuBERT-Large and XLS-R-300M on LibriSpeech and ICBHI test sets.