Table of Contents
Fetching ...

Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification

Sangmin Bae, June-Woo Kim, Won-Yang Cho, Hyerim Baek, Soyoun Son, Byungjo Lee, Changwan Ha, Kyongpil Tae, Sungnyun Kim, Se-Young Yun

TL;DR

This work tackles respiratory sound classification under data scarcity by leveraging an Audio Spectrogram Transformer (AST) pretrained on ImageNet and AudioSet. It introduces Patch-Mix augmentation to mix spectrogram patches and a Patch-Mix Contrastive Learning loss that treats mixed latent representations as positive pairs, enabling robust learning despite label hierarchies. The proposed approach achieves state-of-the-art results on the ICBHI dataset, with a 4-class Score of ${62.37}$% and a 2-class Score of ${68.71}$%, surpassing the previous best by ${4.08}$ percentage points. The findings demonstrate that cross-domain pretrained transformers can effectively generalize to medical audio tasks and that latent-space contrastive learning can substantially improve performance with limited medical data.

Abstract

Respiratory sound contains crucial information for the early diagnosis of fatal lung diseases. Since the COVID-19 pandemic, there has been a growing interest in contact-free medical care based on electronic stethoscopes. To this end, cutting-edge deep learning models have been developed to diagnose lung diseases; however, it is still challenging due to the scarcity of medical data. In this study, we demonstrate that the pretrained model on large-scale visual and audio datasets can be generalized to the respiratory sound classification task. In addition, we introduce a straightforward Patch-Mix augmentation, which randomly mixes patches between different samples, with Audio Spectrogram Transformer (AST). We further propose a novel and effective Patch-Mix Contrastive Learning to distinguish the mixed representations in the latent space. Our method achieves state-of-the-art performance on the ICBHI dataset, outperforming the prior leading score by an improvement of 4.08%.

Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification

TL;DR

This work tackles respiratory sound classification under data scarcity by leveraging an Audio Spectrogram Transformer (AST) pretrained on ImageNet and AudioSet. It introduces Patch-Mix augmentation to mix spectrogram patches and a Patch-Mix Contrastive Learning loss that treats mixed latent representations as positive pairs, enabling robust learning despite label hierarchies. The proposed approach achieves state-of-the-art results on the ICBHI dataset, with a 4-class Score of % and a 2-class Score of %, surpassing the previous best by percentage points. The findings demonstrate that cross-domain pretrained transformers can effectively generalize to medical audio tasks and that latent-space contrastive learning can substantially improve performance with limited medical data.

Abstract

Respiratory sound contains crucial information for the early diagnosis of fatal lung diseases. Since the COVID-19 pandemic, there has been a growing interest in contact-free medical care based on electronic stethoscopes. To this end, cutting-edge deep learning models have been developed to diagnose lung diseases; however, it is still challenging due to the scarcity of medical data. In this study, we demonstrate that the pretrained model on large-scale visual and audio datasets can be generalized to the respiratory sound classification task. In addition, we introduce a straightforward Patch-Mix augmentation, which randomly mixes patches between different samples, with Audio Spectrogram Transformer (AST). We further propose a novel and effective Patch-Mix Contrastive Learning to distinguish the mixed representations in the latent space. Our method achieves state-of-the-art performance on the ICBHI dataset, outperforming the prior leading score by an improvement of 4.08%.
Paper Structure (16 sections, 2 equations, 2 figures, 3 tables)

This paper contains 16 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of Patch-Mix Contrastive Learning. $E_{ij}$ denotes the embedding of $j$-th patch from the $i$-th spectrogram, and $P_j$ is $j$-th positional embedding. The class and distill tokens in the AST are omitted. Note that dashed lines indicate the stop-gradient operation.
  • Figure 2: Ablation study on hyperparameters of Patch-Mix contrastive loss. We searched hyperparameters in a default setting: Patch-Mix, ALL negative pair, and $\footnotesize \texttt{stop}(z)$. We highlight the final chosen hyperparameters by gray color.