Table of Contents
Fetching ...

A Multi-Stage Hybrid CNN-Transformer Network for Automated Pediatric Lung Sound Classification

Samiul Based Shuvo, Taufiq Hasan

TL;DR

The paper tackles pediatric lung-sound classification by developing a multi-stage CNN-Transformer that leverages scalogram representations to accommodate developmental differences in young children. It combines MobileNetV2 for efficient feature extraction with transformer-based attention to refine features, followed by an MLP classifier trained using a class-weighted focal loss. On the SPRSound pediatric dataset, the approach achieves state-of-the-art event-level scores (0.9039 for binary and 0.8448 for multiclass) and recording-level scores (0.720 and 0.571), surpassing prior methods by up to 3.81% and 5.94%. The work highlights strong performance and robustness in pediatric respiratory sound classification, while acknowledging limitations in interpretability and data diversity, and suggests future directions toward explainability and multimodal integration for real-world deployment.

Abstract

Automated analysis of lung sound auscultation is essential for monitoring respiratory health, especially in regions facing a shortage of skilled healthcare workers. While respiratory sound classification has been widely studied in adults, its ap plication in pediatric populations, particularly in children aged <6 years, remains an underexplored area. The developmental changes in pediatric lungs considerably alter the acoustic proper ties of respiratory sounds, necessitating specialized classification approaches tailored to this age group. To address this, we propose a multistage hybrid CNN-Transformer framework that combines CNN-extracted features with an attention-based architecture to classify pediatric respiratory diseases using scalogram images from both full recordings and individual breath events. Our model achieved an overall score of 0.9039 in binary event classifi cation and 0.8448 in multiclass event classification by employing class-wise focal loss to address data imbalance. At the recording level, the model attained scores of 0.720 for ternary and 0.571 for multiclass classification. These scores outperform the previous best models by 3.81% and 5.94%, respectively. This approach offers a promising solution for scalable pediatric respiratory disease diagnosis, especially in resource-limited settings.

A Multi-Stage Hybrid CNN-Transformer Network for Automated Pediatric Lung Sound Classification

TL;DR

The paper tackles pediatric lung-sound classification by developing a multi-stage CNN-Transformer that leverages scalogram representations to accommodate developmental differences in young children. It combines MobileNetV2 for efficient feature extraction with transformer-based attention to refine features, followed by an MLP classifier trained using a class-weighted focal loss. On the SPRSound pediatric dataset, the approach achieves state-of-the-art event-level scores (0.9039 for binary and 0.8448 for multiclass) and recording-level scores (0.720 and 0.571), surpassing prior methods by up to 3.81% and 5.94%. The work highlights strong performance and robustness in pediatric respiratory sound classification, while acknowledging limitations in interpretability and data diversity, and suggests future directions toward explainability and multimodal integration for real-world deployment.

Abstract

Automated analysis of lung sound auscultation is essential for monitoring respiratory health, especially in regions facing a shortage of skilled healthcare workers. While respiratory sound classification has been widely studied in adults, its ap plication in pediatric populations, particularly in children aged <6 years, remains an underexplored area. The developmental changes in pediatric lungs considerably alter the acoustic proper ties of respiratory sounds, necessitating specialized classification approaches tailored to this age group. To address this, we propose a multistage hybrid CNN-Transformer framework that combines CNN-extracted features with an attention-based architecture to classify pediatric respiratory diseases using scalogram images from both full recordings and individual breath events. Our model achieved an overall score of 0.9039 in binary event classifi cation and 0.8448 in multiclass event classification by employing class-wise focal loss to address data imbalance. At the recording level, the model attained scores of 0.720 for ternary and 0.571 for multiclass classification. These scores outperform the previous best models by 3.81% and 5.94%, respectively. This approach offers a promising solution for scalable pediatric respiratory disease diagnosis, especially in resource-limited settings.

Paper Structure

This paper contains 23 sections, 10 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: A graphical overview of the proposed framework for lung sound analysis. The framework begins with the SPRsound dataset, followed by several pre-processing steps, including low-pass filtering, resampling, and normalization. The lung sound signals are then segmented based on labels and padded to a fixed length. These transformed signals are then fed into the proposed model, which performs classification tasks
  • Figure 2: Architecture of the proposed Multi-Stage Hybrid CNN-Transformer model for respiratory sound classification. The model consists of a MobileNet feature extractor, followed by global average pooling, an embedding layer, and a series of Transformer modules ( 8 attention heads with a hidden size of 2048). The final classification is performed by a multi-layer perceptron (MLP) to predict the diagnostic classes.
  • Figure 3: Two t-SNE plots of feature embeddings : one representing the MobileNetV2 model (left) and the other our proposed model (right) with incorporated feature emphasizing through transformer blocks