Table of Contents
Fetching ...

Predicting COVID-19 Prevalence Using Wastewater RNA Surveillance: A Semi-Supervised Learning Approach with Temporal Feature Trust

Yifei Chen, Eric Liang

TL;DR

This work addresses estimating community COVID-19 prevalence without invasive testing by fusing wastewater RNA surveillance with testing data and environmental covariates. It develops a semi-supervised neural network that predicts true daily cases $C_t$ while employing a time-dependent phase-out parameter $(t)$ to downweight unreliable features as data quality evolves. The core contribution is a dual-gradient penalty that suppresses RNA and testing-feature gradients in a principled, temporally aligned manner, yielding about 12.8% MSE and 15.4% MAE improvements across 22 states and notable state-level gains. Limitations include substantial data gaps and RNA data quality issues, with PMMoV normalization proposed as a future enhancement to further stabilize wastewater signals and improve predictions.

Abstract

As COVID-19 transitions into an endemic disease that remains constantly present in the population at a stable level, monitoring its prevalence without invasive measures becomes increasingly important. In this paper, we present a deep neural network estimator for the COVID-19 daily case count based on wastewater surveillance data and other confounding factors. This work builds upon the study by Jiang, Kolozsvary, and Li (2024), which connects the COVID-19 case counts with testing data collected early in the pandemic. Using the COVID-19 testing data and the wastewater surveillance data during the period when both data were highly reliable, one can train an artificial neural network that learns the nonlinear relation between the COVID-19 daily case count and the wastewater viral RNA concentration. From a machine learning perspective, the main challenge lies in addressing temporal feature reliability, as the training data has different reliability over different time periods.

Predicting COVID-19 Prevalence Using Wastewater RNA Surveillance: A Semi-Supervised Learning Approach with Temporal Feature Trust

TL;DR

This work addresses estimating community COVID-19 prevalence without invasive testing by fusing wastewater RNA surveillance with testing data and environmental covariates. It develops a semi-supervised neural network that predicts true daily cases while employing a time-dependent phase-out parameter to downweight unreliable features as data quality evolves. The core contribution is a dual-gradient penalty that suppresses RNA and testing-feature gradients in a principled, temporally aligned manner, yielding about 12.8% MSE and 15.4% MAE improvements across 22 states and notable state-level gains. Limitations include substantial data gaps and RNA data quality issues, with PMMoV normalization proposed as a future enhancement to further stabilize wastewater signals and improve predictions.

Abstract

As COVID-19 transitions into an endemic disease that remains constantly present in the population at a stable level, monitoring its prevalence without invasive measures becomes increasingly important. In this paper, we present a deep neural network estimator for the COVID-19 daily case count based on wastewater surveillance data and other confounding factors. This work builds upon the study by Jiang, Kolozsvary, and Li (2024), which connects the COVID-19 case counts with testing data collected early in the pandemic. Using the COVID-19 testing data and the wastewater surveillance data during the period when both data were highly reliable, one can train an artificial neural network that learns the nonlinear relation between the COVID-19 daily case count and the wastewater viral RNA concentration. From a machine learning perspective, the main challenge lies in addressing temporal feature reliability, as the training data has different reliability over different time periods.

Paper Structure

This paper contains 34 sections, 15 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Alpha function $\alpha(t)$.
  • Figure 1: Gradient distribution analysis showing the suppression of testing-related feature gradients as $\alpha$ increases. The mechanism effectively reduces gradient magnitudes when $\alpha > 1$.
  • Figure 1: IFR data from 29 Feb. 2020 to 1 Mar. 2023 for the states MA, CA, MN, TX, and the baseline IFR
  • Figure 2: Toy Model: Gradient Distribution of dy/dx with respect to $\alpha$
  • Figure 2: Massachusetts predictions comparison
  • ...and 7 more figures