Lifestyle-Informed Personalized Blood Biomarker Prediction via Novel Representation Learning

A. Ali Heydari; Naghmeh Rezaei; Javier L. Prieto; Shwetak N. Patel; Ahmed A. Metwally

Lifestyle-Informed Personalized Blood Biomarker Prediction via Novel Representation Learning

A. Ali Heydari, Naghmeh Rezaei, Javier L. Prieto, Shwetak N. Patel, Ahmed A. Metwally

TL;DR

This paper tackles the challenge of personalizing blood biomarker references by incorporating lifestyle factors (physical activity and sleep) into a representation-learning framework. It introduces a novel deep metric learning approach with a regularized triplet loss to produce compact, clinically meaningful embeddings, which are then combined with current biomarker values to predict future biomarker levels from a single visit. Across the UK Biobank, the authors show that lifestyle differences meaningfully affect biomarker distributions and that their embeddings outperform traditional representations in downstream tasks, boosting future-value prediction accuracy, especially for metabolic biomarkers. The work points toward practical clinical benefits in early disease detection and tailored preventive care, while acknowledging limitations in dataset diversity and follow-up density and outlining plans to validate in more diverse populations.

Abstract

Blood biomarkers are an essential tool for healthcare providers to diagnose, monitor, and treat a wide range of medical conditions. Current reference values and recommended ranges often rely on population-level statistics, which may not adequately account for the influence of inter-individual variability driven by factors such as lifestyle and genetics. In this work, we introduce a novel framework for predicting future blood biomarker values and define personalized references through learned representations from lifestyle data (physical activity and sleep) and blood biomarkers. Our proposed method learns a similarity-based embedding space that captures the complex relationship between biomarkers and lifestyle factors. Using the UK Biobank (257K participants), our results show that our deep-learned embeddings outperform traditional and current state-of-the-art representation learning techniques in predicting clinical diagnosis. Using a subset of UK Biobank of 6440 participants who have follow-up visits, we validate that the inclusion of these embeddings and lifestyle factors directly in blood biomarker models improves the prediction of future lab values from a single lab visit. This personalized modeling approach provides a foundation for developing more accurate risk stratification tools and tailoring preventative care strategies. In clinical settings, this translates to the potential for earlier disease detection, more timely interventions, and ultimately, a shift towards personalized healthcare.

Lifestyle-Informed Personalized Blood Biomarker Prediction via Novel Representation Learning

TL;DR

Abstract

Paper Structure (10 sections, 8 equations, 3 figures)

This paper contains 10 sections, 8 equations, 3 figures.

Introduction
Methods
Novel Deep Metric Learning Approach for Learning Patient Similarity
Personalized Blood Biomarker Models
Data and Data Pre-Processing
Results
Association of Biomarker Values and Lifestyle
Deep Representation Learning Improves Downstream Tasks
Personalized Blood Biomarker Model for Future Value Prediction
Conclusions and Discussion

Figures (3)

Figure 1: Overview of our proposed methodology and data. (a) Our approach for predicting future blood biomarker values from a single lab visit consists of two steps: First, we learn a similarity-based representation of blood biomarkers and lifestyle factors using our novel metric learning technique. Second, using the learned representations in combination with the current value of the biomarker of interest, we train biomarker-specific models for predicting the future biomarker values, which can be used as a personal reference. (b) To showcase our approach on a broad population, we use the United Kingdom Biobank 28-UKB. For representation learning and modeling, we leverage the first assessment (visit), and for assessing the accuracy of future predictions, we utilize the next visit as the prospective validation of our personalized blood biomarker models. (c) We present the data summary in Table I, and provide the complete list of used features. Data statistics are presented as number of instances or percentages for counts, or as mean $\pm$ standard deviation for continuous values.
Figure 2: Differences in biomarker values based on sex, age, activity levels.(a) Distribution (percentiles) of selected lab trends per sex. The $x$-axis represents age for for females (left, shades of orange) and males (right, shades of blue), with the median value highlighted as a black line. Clinical recommended ranges are marked for reference (upper and lower bound represented by dashed purple and green lines, respectively). (b) Result of performing statistical analysis between active and less active individuals (among the currently-healthy group) on a subset of biomarkers. Our results show that many blood biomarker distributions are statistically significantly different based on activity levels, in both males and females and among various age groups. Abbreviations: a.u. stands for arbitrary units after population-wide z-score normalization.
Figure 3: Qualitative and quantitative results of our proposed metric learning on UKB. (a) UMAP visualization of the untransformed space (left column) compared to the UMAP visualization of the learned embedding space through our proposed model (right column), both for female (top) and male (bottom) participants in the UKB. (b) Comparison of commonly-used representations for EHR, namely PCA, Diffusion Maps (DiffMap), DeepPatient, and our proposed representation learning. To show the effect of the representations as opposed to the classification schemes, we use four different classifiers (K-Nearest Neighbors [KNNs], Linear Discriminant Analysis [LDA], Neural Network (NN) for EHR Chen2020Interpretable, and Extreme Gradient Boosting Ensemble [XGBoost]). Boldface values indicate the highest accuracy in terms of weighted F1 scores. (c) Comparison of our proposed metric learning objective with commonly-used metric and contrastive learning objectives, namely InfoNCE infonce, N-Pairs Sohn-npair, Multi-Level Distance Regularization (MDR)Kim2021-MDR, LiftedStruct song2015-liftedstruct, and Distance-Swap Triplet Loss distance-swap-Balntas2016.

Lifestyle-Informed Personalized Blood Biomarker Prediction via Novel Representation Learning

TL;DR

Abstract

Lifestyle-Informed Personalized Blood Biomarker Prediction via Novel Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)