Table of Contents
Fetching ...

InfantCryNet: A Data-driven Framework for Intelligent Analysis of Infant Cries

Mengze Hong, Chen Jason Zhang, Lingxiao Yang, Yuanfeng Song, Di Jiang

TL;DR

InfantCryNet addresses the challenge of interpreting infant cries under background noise and data scarcity by leveraging pre-trained audio models and novel pooling strategies. The framework combines CNN-10 for cry detection and CNN-14 for six-way cry analysis, augmented by statistic pooling and multi-head attention pooling, and further enhances practicality through knowledge distillation and dynamic quantization for mobile deployment. Empirical results show a 4.4 percentage point improvement in classification accuracy over baselines, with model compression achieving up to 28% size reduction and minimal accuracy loss depending on the method. The work offers a practical path toward edge-enabled infant cry monitoring with improved accuracy and efficiency, while suggesting future directions such as federated learning to address data limitations.

Abstract

Understanding the meaning of infant cries is a significant challenge for young parents in caring for their newborns. The presence of background noise and the lack of labeled data present practical challenges in developing systems that can detect crying and analyze its underlying reasons. In this paper, we present a novel data-driven framework, "InfantCryNet," for accomplishing these tasks. To address the issue of data scarcity, we employ pre-trained audio models to incorporate prior knowledge into our model. We propose the use of statistical pooling and multi-head attention pooling techniques to extract features more effectively. Additionally, knowledge distillation and model quantization are applied to enhance model efficiency and reduce the model size, better supporting industrial deployment in mobile devices. Experiments on real-life datasets demonstrate the superior performance of the proposed framework, outperforming state-of-the-art baselines by 4.4% in classification accuracy. The model compression effectively reduces the model size by 7% without compromising performance and by up to 28% with only an 8% decrease in accuracy, offering practical insights for model selection and system design.

InfantCryNet: A Data-driven Framework for Intelligent Analysis of Infant Cries

TL;DR

InfantCryNet addresses the challenge of interpreting infant cries under background noise and data scarcity by leveraging pre-trained audio models and novel pooling strategies. The framework combines CNN-10 for cry detection and CNN-14 for six-way cry analysis, augmented by statistic pooling and multi-head attention pooling, and further enhances practicality through knowledge distillation and dynamic quantization for mobile deployment. Empirical results show a 4.4 percentage point improvement in classification accuracy over baselines, with model compression achieving up to 28% size reduction and minimal accuracy loss depending on the method. The work offers a practical path toward edge-enabled infant cry monitoring with improved accuracy and efficiency, while suggesting future directions such as federated learning to address data limitations.

Abstract

Understanding the meaning of infant cries is a significant challenge for young parents in caring for their newborns. The presence of background noise and the lack of labeled data present practical challenges in developing systems that can detect crying and analyze its underlying reasons. In this paper, we present a novel data-driven framework, "InfantCryNet," for accomplishing these tasks. To address the issue of data scarcity, we employ pre-trained audio models to incorporate prior knowledge into our model. We propose the use of statistical pooling and multi-head attention pooling techniques to extract features more effectively. Additionally, knowledge distillation and model quantization are applied to enhance model efficiency and reduce the model size, better supporting industrial deployment in mobile devices. Experiments on real-life datasets demonstrate the superior performance of the proposed framework, outperforming state-of-the-art baselines by 4.4% in classification accuracy. The model compression effectively reduces the model size by 7% without compromising performance and by up to 28% with only an 8% decrease in accuracy, offering practical insights for model selection and system design.
Paper Structure (19 sections, 3 equations, 2 figures, 5 tables)

This paper contains 19 sections, 3 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Infant cry (left) vs. adult voice (right) in waveform and spectrogram
  • Figure 2: Model architect: (a) CNN10 for infant cry detection, (b) CNN14 for infant cry classification, (c) Knowledge Distillation for model compression.