Table of Contents
Fetching ...

Differentially private federated learning for localized control of infectious disease dynamics

Raouf Kerkouche, Henrik Zunker, Mario Fritz, Martin J. Kühn

TL;DR

This study considers a localized strategy based on the German counties and communities managed by the related local health authorities (LHA) to propose a privacy-preserving forecasting method that can assist public health experts and decision makers.

Abstract

In times of epidemics, swift reaction is necessary to mitigate epidemic spreading. For this reaction, localized approaches have several advantages, limiting necessary resources and reducing the impact of interventions on a larger scale. However, training a separate machine learning (ML) model on a local scale is often not feasible due to limited available data. Centralizing the data is also challenging because of its high sensitivity and privacy constraints. In this study, we consider a localized strategy based on the German counties and communities managed by the related local health authorities (LHA). For the preservation of privacy to not oppose the availability of detailed situational data, we propose a privacy-preserving forecasting method that can assist public health experts and decision makers. ML methods with federated learning (FL) train a shared model without centralizing raw data. Considering the counties, communities or LHAs as clients and finding a balance between utility and privacy, we study a FL framework with client-level differential privacy (DP). We train a shared multilayer perceptron on sliding windows of recent case counts to forecast the number of cases, while clients exchange only norm-clipped updates and the server aggregated updates with DP noise. We evaluate the approach on COVID-19 data on county-level during two phases. As expected, very strict privacy yields unstable, unusable forecasts. At a moderately strong level, the DP model closely approaches the non-DP model: R2 around 0.94 (vs. 0.95) and mean absolute percentage error (MAPE) of 26 % in November 2020; R2 around 0.88 (vs. 0.93) and MAPE of 21 % in March 2022. Overall, client-level DP-FL can deliver useful county-level predictions with strong privacy guarantees, and viable privacy budgets depend on epidemic phase, allowing privacy-compliant collaboration among health authorities for local forecasting.

Differentially private federated learning for localized control of infectious disease dynamics

TL;DR

This study considers a localized strategy based on the German counties and communities managed by the related local health authorities (LHA) to propose a privacy-preserving forecasting method that can assist public health experts and decision makers.

Abstract

In times of epidemics, swift reaction is necessary to mitigate epidemic spreading. For this reaction, localized approaches have several advantages, limiting necessary resources and reducing the impact of interventions on a larger scale. However, training a separate machine learning (ML) model on a local scale is often not feasible due to limited available data. Centralizing the data is also challenging because of its high sensitivity and privacy constraints. In this study, we consider a localized strategy based on the German counties and communities managed by the related local health authorities (LHA). For the preservation of privacy to not oppose the availability of detailed situational data, we propose a privacy-preserving forecasting method that can assist public health experts and decision makers. ML methods with federated learning (FL) train a shared model without centralizing raw data. Considering the counties, communities or LHAs as clients and finding a balance between utility and privacy, we study a FL framework with client-level differential privacy (DP). We train a shared multilayer perceptron on sliding windows of recent case counts to forecast the number of cases, while clients exchange only norm-clipped updates and the server aggregated updates with DP noise. We evaluate the approach on COVID-19 data on county-level during two phases. As expected, very strict privacy yields unstable, unusable forecasts. At a moderately strong level, the DP model closely approaches the non-DP model: R2 around 0.94 (vs. 0.95) and mean absolute percentage error (MAPE) of 26 % in November 2020; R2 around 0.88 (vs. 0.93) and MAPE of 21 % in March 2022. Overall, client-level DP-FL can deliver useful county-level predictions with strong privacy guarantees, and viable privacy budgets depend on epidemic phase, allowing privacy-compliant collaboration among health authorities for local forecasting.

Paper Structure

This paper contains 29 sections, 4 theorems, 11 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathrm{SG}_{q,\sigma}$ be the Sampled Gaussian mechanism for some function $f$ and under the assumption $\Delta_2 f \leq 1$ for any adjacent $D, D' \in \mathcal{E}$. Then $\mathrm{SG}_{q,\sigma}$ satisfies $(\alpha,\rho)$-RDP if where $A_{\alpha} \overset{\Delta}{=} \mathbb{E}_{z\sim \mu_0} [\left( \mu(z)/\mu_0(z)\right)^\alpha]$ and $B_{\alpha} \overset{\Delta}{=} \mathbb{E}_{z\sim \mu} [\

Figures (9)

  • Figure 1: Distribution of population sizes on community and county-level. Panel A shows the distribution of county population sizes across all 400 German counties, with a median population of 147,524 and mean population of 200,525. Panel B displays the distribution of community population sizes across the 10,786 communities included in our analysis, with a substantially smaller median population of 1,832 and mean of 7,827. Red and orange vertical lines indicate median and mean values, respectively.
  • Figure 2: Spatial resolution of German counties and communities and community-based case data. Panel A presents two maps shown side by side. The left map presents North Rhine-Westphalia (NRW) stratified into counties (thick black boundaries) and communities (thin gray boundaries), highlighting county "Coesfeld" in red. The right map displays the stratification of Germany into federal states (thick black boundaries) and counties (thin gray boundaries), highlighting the federal state NRW in orange. Panel B shows the obtained community-based dataset for "Coesfeld" for March 2022, with the county-level aggregate shown in red and individual community trajectories in various colors. The left map from Panel A is using geodata "Verwaltungsgebiete 1:250 000 (VG250)" from BKG (2026) dl-de/by-2-0, Data sources: https://sgx.geodatenzentrum.de/web_public/gdz/datenquellen/datenquellen_vg_nuts.pdf.
  • Figure 3: Communities with zero cases on input horizon and prediction day. Panel A shows the number of communities with 0 to 11 zero entries in the considered time series for November 2020 and Panel B shows the size of the community populations with the corresponding number of zeros. Panel C and D report the same structure for March 2022.
  • Figure 4: Analysis of COVID-19 case numbers in Germany on county-level. Panel A presents the time series of daily new infections throughout early 2022, showing both raw daily counts (blue) and the 7-day moving average (orange) that smooths the weekly reporting pattern. Panel B plots the geographical distribution of the COVID-19 incidence per 100,000 population at four weekly intervals in March 2022, illustrating the spatial heterogeneity of infection patterns and their temporal evolution. Panel C and D show the same type of data for late 2020.
  • Figure 5: Comparison of prediction performance across different privacy levels for March 2022 data. Panels A-E show scatter plots of predicted against true case counts for various privacy budgets: (A) $\varepsilon = 0.3$, (B) $\varepsilon = 0.5$, (C) $\varepsilon = 1.0$, (D) $\varepsilon = 2.0$, and (E) $\varepsilon = \infty$ (non-DP). The black diagonal line represents perfect prediction. Panel F displays box plots of Mean Absolute Percentage Error (MAPE) for individual county predictions across all runs and privacy levels, illustrating the decreasing error and variance as privacy budget increases.
  • ...and 4 more figures

Theorems & Definitions (9)

  • Definition 1: Differential Privacy dwork2014algorithmic
  • Definition 2
  • Definition 3: Rényi divergence
  • Definition 4: Rényi differential privacy (RDP)
  • Definition 5: Sampled Gaussian Mechanism (SGM)
  • Theorem 1
  • Theorem 2: Composability mironov2017renyi
  • Theorem 3: Conversion from RDP to DP balle2020hypothesis
  • Theorem 4: Privacy of our approach