Table of Contents
Fetching ...

Bayesian Inference for Epidemic Final Size Datasets with Hidden Underlying Household Structure

Joseph Brooks, Thomas House, Lorenzo Pellis, Joe Hilton

Abstract

Households represent a key unit of interest in infectious disease epidemiology, in both empirical studies and mathematical modelling. The within-household transmission potential of a disease is often summarised by a secondary attack ratio (SAR). Despite its widespread use, the SAR depends on the household size distribution (HHSD) seen during the study period, making it difficult to generalise to new contexts. Extending estimates of transmission potential to new populations instead requires estimates of person-to-person transmission rates which can be convoluted with data on population structure to parametrise mechanistic transmission models. In this study we present a new Bayesian inference method which uses an MCMC algorithm to infer the transmission intensity by imputing the unreported household structure underlying the epidemic. This method can be run on household epidemiological data reported at varying levels of resolution. For synthetic data from a realistic underlying HHSD, we were able to achieve over 95% coverage in our estimates of transmission rate consistently. We were also able to consistently achieve over 95% coverage for data generated with a pathological underlying HHSD, given strong information about the HHSD. Using an existing dataset which recorded micro-scale household epidemiological outcomes during the COVID-19 pandemic, we show that stratifying observed SARs by household size substantially reduces the uncertainty in estimates. Our findings suggest that researchers conducting household epidemiological studies can improve the utility of results for infectious disease modellers by reporting household-stratified estimates. These results aim to encourage the reporting of higher resolution outputs in epidemiological field work as, in the absence of strong priors, transmission parameters were not easily identifiable from low resolution datasets, which are often reported.

Bayesian Inference for Epidemic Final Size Datasets with Hidden Underlying Household Structure

Abstract

Households represent a key unit of interest in infectious disease epidemiology, in both empirical studies and mathematical modelling. The within-household transmission potential of a disease is often summarised by a secondary attack ratio (SAR). Despite its widespread use, the SAR depends on the household size distribution (HHSD) seen during the study period, making it difficult to generalise to new contexts. Extending estimates of transmission potential to new populations instead requires estimates of person-to-person transmission rates which can be convoluted with data on population structure to parametrise mechanistic transmission models. In this study we present a new Bayesian inference method which uses an MCMC algorithm to infer the transmission intensity by imputing the unreported household structure underlying the epidemic. This method can be run on household epidemiological data reported at varying levels of resolution. For synthetic data from a realistic underlying HHSD, we were able to achieve over 95% coverage in our estimates of transmission rate consistently. We were also able to consistently achieve over 95% coverage for data generated with a pathological underlying HHSD, given strong information about the HHSD. Using an existing dataset which recorded micro-scale household epidemiological outcomes during the COVID-19 pandemic, we show that stratifying observed SARs by household size substantially reduces the uncertainty in estimates. Our findings suggest that researchers conducting household epidemiological studies can improve the utility of results for infectious disease modellers by reporting household-stratified estimates. These results aim to encourage the reporting of higher resolution outputs in epidemiological field work as, in the absence of strong priors, transmission parameters were not easily identifiable from low resolution datasets, which are often reported.

Paper Structure

This paper contains 4 sections, 11 equations, 11 figures, 1 table, 2 algorithms.

Figures (11)

  • Figure 1: Visual representation of the low information proposal algorithm (Algorithm \ref{['alg:low info proposal']}). Three hypothetical proposal steps are shown for a low information dataset with $(N,y,n) = (24,25,45)$ and maximum number of household contacts $m=3$. On the left hand side the indices of the outcomes are shown with pictorial representations of these outcomes (primary cases, secondary cases and non-cases shown by red Ps, red positive signs and blue negative signs respectively). Each proposal move is represented by a black arrow starting at an entry in the previous pseudo-dataset and ending at an entry in the new pseudo-dataset with their respective indices being $k_1$ and $k_2$ in the notation of Algorithm \ref{['alg:low info proposal']}. The infectious status of the contact being moved is shown by the symbol on the arrow. Data that have changed in a step are shown as white digits with a black outline.
  • Figure 2: In each subplot, 95% confidence intervals for the posteriors of $\beta$ are shown for 100 low information synthetic datasets per household size distribution. Each dataset is generated using the Ball model ($I\sim \text{Gamma}(2,2)$) and 1000 household sizes sampled either from the UK LFS (2023) or the "split" household size distribution (see Appendix \ref{['appendix: hh size dist']} for details), and $\beta$ is re-estimated separately using $\alpha_0 = 100$ and $1000$. Each subplot row shows results for synthetic data generated with different base transmission rate ($\beta$) and each column shows results for different density mixing parameter ($\eta$). The value of $\beta$ used to generate the data is shown by the vertical dotted line in each plot and confidence intervals are plotted in green if this value is contained in them and red otherwise. The percentage of synthetic datasets for which the real value is contained within the 95% confidence interval is shown above each set of confidence intervals. All fits were done with $I\sim \text{Gamma}(2,2)$ for a known $\eta$ and so only $\beta$ was inferred.
  • Figure 3: Panels A, B, and C show the posterior distributions of $\theta$ obtained by fitting the low- (red), medium- (green), and high-information (blue) versions of the data from carazo_characterization_2021, respectively. Panel D displays the secondary attack rate (SAR) implied by each posterior alongside the observed SAR (black) for each household size and overall; error bars represent 95% confidence intervals, with those for the observed SAR estimated by bootstrapping. Bars indicate the number of households of each size in the low-information fit (red) compared with the observed distribution (black). The infectious period is assumed to be $I \sim \text{Gamma}(2,2)$.
  • Figure 4: Base transmission rates ($\beta$) for SARS-CoV-2 estimated from low-information datasets from studies included in Madewell et al.madewell_household_2020madewell_factors_2021madewell_household_2022, colour coded and grouped by strain where it was specified. A table on the right hand side lists the SAR, average household size and number of households in each of these studies.
  • Figure S1: Size-weighted household size distributions that were used to generate synthetic data. The distribution for the UK LFS (2023) ons_families_2024 and the "split" distribution are shown in red and blue, respectively. Both distributions share a mean final size, after size weighting, of 3.32 which is shown by the vertical black dotted line. For the purpose of this paper households reported to have 6+ individuals in the UK LFS are assumed to have size exactly 6.
  • ...and 6 more figures