Table of Contents
Fetching ...

Identifying Critical Phases for Disease Onset with Sparse Haematological Biomarkers

Andrea Zerio, Maya Bechler-Speicher, Tine Jess, Aleksejs Sazonovs

TL;DR

The paper tackles the challenge of detecting disease onset from irregularly sampled haematological biomarkers without resorting to imputation. It introduces Graph Neural Additive Networks (GNAN) applied to time-weighted directed graphs where each biomarker trajectory is a graph $G_k=(V,E)$ with edge weights $w_t = \rho(\Delta_t)$ and $\Delta_t = t_{t+1}-t_t$ to preserve temporal structure. The model enables interpretable predictions by decomposing contributions at the node (time-point) and feature (biomarker) level, with node representations defined as $[\mathbf{h}_i]_k = \sum_{j \in V} \rho\left(\frac{1}{0.1 + \Delta t_{ji}}\right) f_k(\mathbf{x}_j^{(k)})$, and an augmented one-hot biomarker feature. Empirically, they examine 2,500 CD patients and 2,500 controls using haemoglobin, albumin, and C-reactive protein, observing early discriminative signals and interpretable node-level importance on synthetic data, but performance is not yet clinically adequate and future work includes recurrent architectures and expanding biomarker/omics coverage.

Abstract

Routinely collected clinical blood tests are an emerging molecular data source for large-scale biomedical research but inherently feature irregular sampling and informative observation. Traditional approaches rely on imputation, which can distort learning signals and bias predictions while lacking biological interpretability. We propose a novel methodology using Graph Neural Additive Networks (GNAN) to model biomarker trajectories as time-weighted directed graphs, where nodes represent sampling events and edges encode the time delta between events. GNAN's additive structure enables the explicit decomposition of feature and temporal contributions, allowing the detection of critical disease-associated time points. Unlike conventional imputation-based approaches, our method preserves the temporal structure of sparse data without introducing artificial biases and provides inherently interpretable predictions by decomposing contributions from each biomarker and time interval. This makes our model clinically applicable, as well as allowing it to discover biologically meaningful disease signatures.

Identifying Critical Phases for Disease Onset with Sparse Haematological Biomarkers

TL;DR

The paper tackles the challenge of detecting disease onset from irregularly sampled haematological biomarkers without resorting to imputation. It introduces Graph Neural Additive Networks (GNAN) applied to time-weighted directed graphs where each biomarker trajectory is a graph with edge weights and to preserve temporal structure. The model enables interpretable predictions by decomposing contributions at the node (time-point) and feature (biomarker) level, with node representations defined as , and an augmented one-hot biomarker feature. Empirically, they examine 2,500 CD patients and 2,500 controls using haemoglobin, albumin, and C-reactive protein, observing early discriminative signals and interpretable node-level importance on synthetic data, but performance is not yet clinically adequate and future work includes recurrent architectures and expanding biomarker/omics coverage.

Abstract

Routinely collected clinical blood tests are an emerging molecular data source for large-scale biomedical research but inherently feature irregular sampling and informative observation. Traditional approaches rely on imputation, which can distort learning signals and bias predictions while lacking biological interpretability. We propose a novel methodology using Graph Neural Additive Networks (GNAN) to model biomarker trajectories as time-weighted directed graphs, where nodes represent sampling events and edges encode the time delta between events. GNAN's additive structure enables the explicit decomposition of feature and temporal contributions, allowing the detection of critical disease-associated time points. Unlike conventional imputation-based approaches, our method preserves the temporal structure of sparse data without introducing artificial biases and provides inherently interpretable predictions by decomposing contributions from each biomarker and time interval. This makes our model clinically applicable, as well as allowing it to discover biologically meaningful disease signatures.

Paper Structure

This paper contains 10 sections, 8 equations, 5 figures.

Figures (5)

  • Figure 1: Diagram illustrating node-level interpretability of synthetically generated data. The size of each biomarker node corresponds to the importance that the model assigned to it. Individual-level trajectories cannot be provided due to Danish data protection laws.
  • Figure 2: Comparison of patient and control data distributions after downsampling controls by age. Values below n=5 were excluded due to Danish data protection rules.
  • Figure 3: Logspace of batch training loss (BCEWithLogitsLoss). The plots demonstrate the model is capable of learning some initial signal, before finding a local minimum early on and stabilising.
  • Figure 4: Test AUC and Accuracy across training. The two lines in each plot correspond to different segments of the same training run, where training was resumed from a checkpoint after reaching the initial stopping point
  • Figure 5: Comparison of test set confusion matrices. Despite not yet being performant enough to be clinically relevant, the model does seem to learn some initial signal that is discriminative of CD patients and controls.