Table of Contents
Fetching ...

Multi-View Spectrogram Transformer for Respiratory Sound Classification

Wentao He, Yuchen Yan, Jianfeng Ren, Ruibin Bai, Xudong Jiang

TL;DR

This paper tackles automatic respiratory sound classification by addressing the mismatch between spectrograms and natural images. It introduces the Multi-View Spectrogram Transformer (MVST), which renders multiple time-frequency views of a mel-spectrogram through differently shaped patches, processes each view with a dedicated Transformer encoder, and fuses the views via a gated mechanism. The main contributions are: (1) multi-view patch splitting to capture diverse spectral characteristics, (2) view-specific Transformer encoders to extract attentional patterns, and (3) a gated fusion strategy that automatically emphasizes the most discriminative spectral view. Evaluated on the ICBHI dataset, MVST achieves state-of-the-art performance across specificity, sensitivity, and average score, indicating practical gains for automated auscultation and lung-disease screening.

Abstract

Deep neural networks have been applied to audio spectrograms for respiratory sound classification. Existing models often treat the spectrogram as a synthetic image while overlooking its physical characteristics. In this paper, a Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer. Specifically, the proposed MVST splits the mel-spectrogram into different sized patches, representing the multi-view acoustic elements of a respiratory sound. These patches and positional embeddings are then fed into transformer encoders to extract the attentional information among patches through a self-attention mechanism. Finally, a gated fusion scheme is designed to automatically weigh the multi-view features to highlight the best one in a specific scenario. Experimental results on the ICBHI dataset demonstrate that the proposed MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds.

Multi-View Spectrogram Transformer for Respiratory Sound Classification

TL;DR

This paper tackles automatic respiratory sound classification by addressing the mismatch between spectrograms and natural images. It introduces the Multi-View Spectrogram Transformer (MVST), which renders multiple time-frequency views of a mel-spectrogram through differently shaped patches, processes each view with a dedicated Transformer encoder, and fuses the views via a gated mechanism. The main contributions are: (1) multi-view patch splitting to capture diverse spectral characteristics, (2) view-specific Transformer encoders to extract attentional patterns, and (3) a gated fusion strategy that automatically emphasizes the most discriminative spectral view. Evaluated on the ICBHI dataset, MVST achieves state-of-the-art performance across specificity, sensitivity, and average score, indicating practical gains for automated auscultation and lung-disease screening.

Abstract

Deep neural networks have been applied to audio spectrograms for respiratory sound classification. Existing models often treat the spectrogram as a synthetic image while overlooking its physical characteristics. In this paper, a Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer. Specifically, the proposed MVST splits the mel-spectrogram into different sized patches, representing the multi-view acoustic elements of a respiratory sound. These patches and positional embeddings are then fed into transformer encoders to extract the attentional information among patches through a self-attention mechanism. Finally, a gated fusion scheme is designed to automatically weigh the multi-view features to highlight the best one in a specific scenario. Experimental results on the ICBHI dataset demonstrate that the proposed MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds.
Paper Structure (11 sections, 5 equations, 2 figures, 2 tables)

This paper contains 11 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of proposed Multi-View Spectrogram Transformer (MVST). The mel-spectrogram of an input audio is split into different-sized patches, to analyze the multi-view spectral characteristics. The multi-view spectrogram patches are then processed by multi-scale transformer encoders to capture the attentional information among patches. A gated fusion mechanism is then designed to highlight the most suitable spectral view for respiratory sound classification.
  • Figure 2: Two example spectrograms sliced from the same crackling & breathing cycle at different time intervals.