Table of Contents
Fetching ...

Improving Epidemic Analyses with Privacy-Preserving Integration of Sensitive Data

Zihan Guan, Zhiyuan Zhao, Fengwei Tian, Dung Nguyen, Payel Bhattacharjee, Ravi Tandon, B. Aditya Prakash, Anil Vullikanti

Abstract

Epidemic analyses increasingly rely on heterogeneous datasets, many of which are sensitive and require strong privacy protection. Although differential privacy (DP) has become a standard in machine learning and data sharing, its adoption in epidemiological modeling remains limited. In this work, we introduce DPEpiNN, a unified framework that integrates deep neural networks with a mechanistic SEIRM-based metapopulation model under formal DP guarantees. DPEpiNN supports multiple epidemic tasks (including multi-step forecasting, nowcasting, effective reproduction number $(R_t)$ estimation, and intervention analysis) within a single differentiable pipeline. The framework jointly learns epidemic parameters from heterogeneous public and sensitive datasets, while ensuring privacy via input perturbation mechanisms. We evaluate DPEpiNN using COVID-19 data from three regions. Results show that incorporating sensitive datasets substantially improves predictive performance even under strong privacy constraints. Compared with a deep learning baseline, DPEpiNN achieves higher accuracy in forecasting and nowcasting while producing reliable estimates of $R_t$. Furthermore, the learned epidemic transmission models remain inherently private due to the post-processing property of differential privacy, enabling downstream policy analyses such as simulation of social distancing interventions. Our work demonstrates that interpretability (through mechanistic modeling), predictive accuracy (through neural integration), and rigorous privacy guarantees can be jointly achieved in modern epidemic modeling.

Improving Epidemic Analyses with Privacy-Preserving Integration of Sensitive Data

Abstract

Epidemic analyses increasingly rely on heterogeneous datasets, many of which are sensitive and require strong privacy protection. Although differential privacy (DP) has become a standard in machine learning and data sharing, its adoption in epidemiological modeling remains limited. In this work, we introduce DPEpiNN, a unified framework that integrates deep neural networks with a mechanistic SEIRM-based metapopulation model under formal DP guarantees. DPEpiNN supports multiple epidemic tasks (including multi-step forecasting, nowcasting, effective reproduction number estimation, and intervention analysis) within a single differentiable pipeline. The framework jointly learns epidemic parameters from heterogeneous public and sensitive datasets, while ensuring privacy via input perturbation mechanisms. We evaluate DPEpiNN using COVID-19 data from three regions. Results show that incorporating sensitive datasets substantially improves predictive performance even under strong privacy constraints. Compared with a deep learning baseline, DPEpiNN achieves higher accuracy in forecasting and nowcasting while producing reliable estimates of . Furthermore, the learned epidemic transmission models remain inherently private due to the post-processing property of differential privacy, enabling downstream policy analyses such as simulation of social distancing interventions. Our work demonstrates that interpretability (through mechanistic modeling), predictive accuracy (through neural integration), and rigorous privacy guarantees can be jointly achieved in modern epidemic modeling.

Paper Structure

This paper contains 27 sections, 15 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of epidemic analyses supported by DPEpiNN. (a) The goal of forecasting is to predict the targets for the next $H$ days given the time series observed at current time $T$; (b) Nowcasting involves revising the currently observed time series to its stable versions. (c) This is used to estimate the effective reproduction number. $R_t$; (d) The meta-population module takes a public contact matrix $C$ across age groups, a public population vector $\beta$ stratified by age, and the predicted epidemic parameters as input. Using the age-stratified pandemic transmission equations, the meta-population model simulates and aggregates the daily new infection count (i.e., the prediction target) for each time stamp $t$. (e) DPEpiNN consists of three modules: parameterNN, Meta-population Model, and Error Correction Adapter. Datasets are processed by the parameterNN modules to generate epidemic parameters using encoders of varying granularities (Step 1). These parameters are then input into the meta-population model to generate pandemic simulations (Step 2). An error correction adapter further corrects errors in the simulations adaptively (Step 3). After loss computation (Step 4), the gradients are back-propagated to the parameterNN to further update the network (Step 5). The input module privatizes the sensitive time series for a given privacy budget $(\epsilon, \delta)$, by using input perturbation method (e.g., randomized response and Laplace mechanism).
  • Figure 2: Forecasting Performance a. Multi-step forecasting performance on the Bogota, Medellin, and the USA datasets. The privacy budget is $\epsilon=1$. b. Visualizations of single-step forecasting on the Bogotá dataset. Different colors indicate different settings: red denotes forecasting without sensitive data; green denotes forecasting with sensitive data under no privacy protection; purple denotes forecasting with sensitive data privatized using the Laplace mechanism; and orange denotes forecasting with sensitive data privatized using the RR mechanism. c. Visualizations of the multi-step forecasting, where the training periods shift from 50 weeks to 58 weeks and the testing period is fixed as 4-weeks (28 days). Red and green denote forecasting from the meta-population module and the error correction adapter, respectively. Purple denotes forecasting from the LSTM model.
  • Figure 3: Nowcasting Performance a. The left panel shows the single-step nowcasting performance on the Bogota and the USA datasets. The right panel shows the multi-step nowcasting performance on the Bogota and the USA datasets. b. Visualizations of single-step nowcasting results. Different colors indicate different settings: red denotes the real time input; green denotes nowcasting with sensitive data under no privacy protection; purple denotes nowcasting with sensitive data privatized using the RR mechanism; and orange denotes nowcasting with sensitive data privatized using the Laplace mechanism. c. Visualizations of multi-step nowcasting results. The model is trained on a 40-week period and then continuously deployed using shifting window sizes of 41, 42, and 43 weeks. Red denotes the real-time input; green denotes nowcasts using sensitive data under no privacy protection; purple denotes nowcasts using sensitive data privatized with the RR mechanism; and orange denotes nowcasts using sensitive data privatized with the Laplace mechanism.
  • Figure 4: Time-varying $R_t$ estimation by EpiEstim with three inputs: (1) Corrected (stable) daily report of new cases; (2) nowcasted daily new cases by the DPEpiNN under public setting; and (3) nowcasted daily new cases by the DPEpiNN under private setting.
  • Figure 5: Forecasting using the LSTM technique: we use a multi-encoder structure, which allows different types of datasets (whose shapes are not always consistent) to be incorporated. Transaction data is incorporated through a separate encoder. Predictions are generated directly by the LSTM in an autoregressive manner.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 1: $\{\epsilon, \delta \}$ Label Differential Privacy