Table of Contents
Fetching ...

Infant Cry Detection Using Causal Temporal Representation

Minghao Fu, Danning Li, Aryan Gadhiya, Benjamin Lambright, Mohamed Alowais, Mohab Bahnassy, Saad El Dine Elletter, Hawau Olamide Toyin, Haiyan Jiang, Kun Zhang, Hanan Aldarmaki

TL;DR

The paper tackles robust infant cry detection under real-world noise and limited fine-grained labels by introducing an annotated cry-segmentation dataset and an unsupervised method based on causal temporal representation (CRSTC). The unsupervised framework combines a Sparse Transition VAE (ST-VAE) with temporal clustering to identify latent temporal patterns and align them with cry segments, while a theoretical identifiability guarantee supports domain-variable recovery under specific conditions. Empirically, CRSTC achieves competitive results against supervised baselines and improves downstream cry classification across datasets, highlighting the practical impact for infant care. Overall, the work advances both data resources and unsupervised learning techniques to enhance real-world infant cry detection and interpretation.

Abstract

This paper addresses a major challenge in acoustic event detection, in particular infant cry detection in the presence of other sounds and background noises: the lack of precise annotated data. We present two contributions for supervised and unsupervised infant cry detection. The first is an annotated dataset for cry segmentation, which enables supervised models to achieve state-of-the-art performance. Additionally, we propose a novel unsupervised method, Causal Representation Spare Transition Clustering (CRSTC), based on causal temporal representation, which helps address the issue of data scarcity more generally. By integrating the detected cry segments, we significantly improve the performance of downstream infant cry classification, highlighting the potential of this approach for infant care applications.

Infant Cry Detection Using Causal Temporal Representation

TL;DR

The paper tackles robust infant cry detection under real-world noise and limited fine-grained labels by introducing an annotated cry-segmentation dataset and an unsupervised method based on causal temporal representation (CRSTC). The unsupervised framework combines a Sparse Transition VAE (ST-VAE) with temporal clustering to identify latent temporal patterns and align them with cry segments, while a theoretical identifiability guarantee supports domain-variable recovery under specific conditions. Empirically, CRSTC achieves competitive results against supervised baselines and improves downstream cry classification across datasets, highlighting the practical impact for infant care. Overall, the work advances both data resources and unsupervised learning techniques to enhance real-world infant cry detection and interpretation.

Abstract

This paper addresses a major challenge in acoustic event detection, in particular infant cry detection in the presence of other sounds and background noises: the lack of precise annotated data. We present two contributions for supervised and unsupervised infant cry detection. The first is an annotated dataset for cry segmentation, which enables supervised models to achieve state-of-the-art performance. Additionally, we propose a novel unsupervised method, Causal Representation Spare Transition Clustering (CRSTC), based on causal temporal representation, which helps address the issue of data scarcity more generally. By integrating the detected cry segments, we significantly improve the performance of downstream infant cry classification, highlighting the potential of this approach for infant care applications.

Paper Structure

This paper contains 12 sections, 1 theorem, 2 equations, 3 figures, 2 tables.

Key Result

Theorem 1

(Identifiability of Domain Variables (directly from Thm. 1 in Song2024Causal)) Suppose that the dataset $\mathcal{D}$ is generated from a nonstationary data generation process. Suppose the following assumptions hold: Then the domain variables $u_t$ are identifiable up to label swapping.

Figures (3)

  • Figure 1: Illustration of graphical model for nonstationary temporal data with unobserved domain variables $u_t$
  • Figure 2: Sparse transition variational autoencoder with temporal clustering
  • Figure 3: Comparison of Supervised (top) and Unsupervised (bottom) infant cry detection model predictions.

Theorems & Definitions (1)

  • Theorem 1