A Multi-Stage Hybrid CNN-Transformer Network for Automated Pediatric Lung Sound Classification

Samiul Based Shuvo; Taufiq Hasan

A Multi-Stage Hybrid CNN-Transformer Network for Automated Pediatric Lung Sound Classification

Samiul Based Shuvo, Taufiq Hasan

TL;DR

The paper tackles pediatric lung-sound classification by developing a multi-stage CNN-Transformer that leverages scalogram representations to accommodate developmental differences in young children. It combines MobileNetV2 for efficient feature extraction with transformer-based attention to refine features, followed by an MLP classifier trained using a class-weighted focal loss. On the SPRSound pediatric dataset, the approach achieves state-of-the-art event-level scores (0.9039 for binary and 0.8448 for multiclass) and recording-level scores (0.720 and 0.571), surpassing prior methods by up to 3.81% and 5.94%. The work highlights strong performance and robustness in pediatric respiratory sound classification, while acknowledging limitations in interpretability and data diversity, and suggests future directions toward explainability and multimodal integration for real-world deployment.

Abstract

Automated analysis of lung sound auscultation is essential for monitoring respiratory health, especially in regions facing a shortage of skilled healthcare workers. While respiratory sound classification has been widely studied in adults, its ap plication in pediatric populations, particularly in children aged <6 years, remains an underexplored area. The developmental changes in pediatric lungs considerably alter the acoustic proper ties of respiratory sounds, necessitating specialized classification approaches tailored to this age group. To address this, we propose a multistage hybrid CNN-Transformer framework that combines CNN-extracted features with an attention-based architecture to classify pediatric respiratory diseases using scalogram images from both full recordings and individual breath events. Our model achieved an overall score of 0.9039 in binary event classifi cation and 0.8448 in multiclass event classification by employing class-wise focal loss to address data imbalance. At the recording level, the model attained scores of 0.720 for ternary and 0.571 for multiclass classification. These scores outperform the previous best models by 3.81% and 5.94%, respectively. This approach offers a promising solution for scalable pediatric respiratory disease diagnosis, especially in resource-limited settings.

A Multi-Stage Hybrid CNN-Transformer Network for Automated Pediatric Lung Sound Classification

TL;DR

Abstract

A Multi-Stage Hybrid CNN-Transformer Network for Automated Pediatric Lung Sound Classification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)