Table of Contents
Fetching ...

Learning relationships in epidemiological data using graph neural networks

Anthony J Wood, Aeron R Sanchez, Rowland R Kao

Abstract

When designing control strategies for an infectious disease it is critical to identify the key pathways of transmission. Data on infected hosts - when they were born, where they lived and with whom they interacted - can help infer sources of infection and transmission clusters. However such data are generally not powerful enough to identify infector-infectee pairs with any certainty. Whole-genome sequencing data of the underlying pathogen, on the other hand, can serve as a powerful adjoint to these data as they can be used to estimate a time to a most recent common ancestor between two infected hosts. and in turn their relative proximity in the transmission tree. A statistical model that explains the genetic distance between different host pathogens and associated risk factors can therefore inform key risk factors for transmission itself. We show how graph neural networks (GNNs) are a powerful and natural modelling architecture for such a problem. By treating the epidemiological dataset as a graph where infected hosts are nodes and edges are weighted by the genetic distance between different host pairs, we show how a GNN can be fit to predict the genetic distance between known hosts and new, unsequenced hosts. Comparisons with other established approaches show that GNNs have useful performance advantages albeit with greater computational cost.

Learning relationships in epidemiological data using graph neural networks

Abstract

When designing control strategies for an infectious disease it is critical to identify the key pathways of transmission. Data on infected hosts - when they were born, where they lived and with whom they interacted - can help infer sources of infection and transmission clusters. However such data are generally not powerful enough to identify infector-infectee pairs with any certainty. Whole-genome sequencing data of the underlying pathogen, on the other hand, can serve as a powerful adjoint to these data as they can be used to estimate a time to a most recent common ancestor between two infected hosts. and in turn their relative proximity in the transmission tree. A statistical model that explains the genetic distance between different host pathogens and associated risk factors can therefore inform key risk factors for transmission itself. We show how graph neural networks (GNNs) are a powerful and natural modelling architecture for such a problem. By treating the epidemiological dataset as a graph where infected hosts are nodes and edges are weighted by the genetic distance between different host pairs, we show how a GNN can be fit to predict the genetic distance between known hosts and new, unsequenced hosts. Comparisons with other established approaches show that GNNs have useful performance advantages albeit with greater computational cost.

Paper Structure

This paper contains 14 sections, 14 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Graph neural network architecture. This model evaluates the probability that a pair of hosts $i$, $j$ are closely related, when one of the hosts does not have a known pathogen sequence. The input data (light pink) are the node attributes of those individual hosts $\underline{n}_i$, $\underline{n}_j$, the edge attributes of that host pair $\underline{e}_{ij}$, the node attributes of all hosts in the dataset $\mathbf{N}$, and the edge attributes of all hosts in the dataset $\mathbf{E}$ (including known genetic distances). These data feed in through neural network modules (grey), with the intermediate outputs termed embedded representations (dark pink). The final output $d^{\mathrm{pred}}_{ij}$ (blue) is a scalar between 0 and 1.
  • Figure 2: Model performance over the synthetic datasets ($H=2\,000$ hosts each). Left: Classification of test host pairs $(i,j)$ for each model, separated by whether they are truly closely related $(d_{ij} = 1)$ or distant $(d_{ij} = 0)$. Right: mean prediction entropy (MPE, where lower MPE indicates more confident predictions), balanced accuracy (BA) and area under the receiver-operator characteristic curve (ROC-AUC).
  • Figure 3: Model importance over the synthetic models. This is the loss in balanced accuracy on the test host pairs, on random permutation of a given variable in the dataset (effectively removing it). The $\texttt{Genetic\_Distance}$ attribute is only populated for edges in the train dataset for the GNN model. Other variables with a value for the GNN only are node-level attributes.
  • Figure 4: Model performance over the Woodchester dataset ($H=241$ hosts). Left: Classification of test host pairs $(i,j)$ for each model, separated by whether they are truly closely related $(d_{ij} = 1)$ or distant $(d_{ij} = 0)$. Right: mean prediction entropy (MPE, where lower MPE indicates more confident predictions), balanced accuracy (BA) and area under the receiver-operator characteristic curve (ROC-AUC).
  • Figure 5: Model importance over the Woodchester model as quantified by the loss in balanced accuracy on the test host pairs, on random permutation of a given variable in the dataset (effectively removing it). The $\texttt{Genetic\_Distance}$ attribute is only populated for edges in the train dataset for the GNN model. Other variables with a value for the GNN only are node-level attributes.
  • ...and 7 more figures