Table of Contents
Fetching ...

Improving Diseases Predictions Utilizing External Bio-Banks

Hido Pinto, Eran Segal

TL;DR

This work tackles the scarcity of disease labels in metabolomics by training LightGBM models on a metabolomics-rich, smaller dataset (10K) to impute missing metabolomics features in a large biobank (UKBB). The imputed features are then incorporated into survival analyses and GWAS to uncover biologically meaningful associations and validate signals that were absent from the training data. Notably, a genetic link between vascular dementia and smoking emerges from GWAS, and survival-model reintegration identifies obesity-related metabolic signals, demonstrating the value of cross-dataset integration. While imputation reveals important biological insights, direct improvements in disease outcome prediction are variable, underscoring the need for more robust multi-modal imputation and causal inference approaches for practical impact.

Abstract

Machine learning has been successfully used in critical domains, such as medicine. However, extracting meaningful insights from biomedical data is often constrained by the lack of their available disease labels. In this research, we demonstrate how machine learning can be leveraged to enhance explainability and uncover biologically meaningful associations, even when predictive improvements in disease modeling are limited. We train LightGBM models from scratch on our dataset (10K) to impute metabolomics features and apply them to the UK Biobank (UKBB) for downstream analysis. The imputed metabolomics features are then used in survival analysis to assess their impact on disease-related risk factors. As a result, our approach successfully identified biologically relevant connections that were not previously known to the predictive models. Additionally, we applied a genome-wide association study (GWAS) on key metabolomics features, revealing a link between vascular dementia and smoking. Although being a well-established epidemiological relationship, this link was not embedded in the model's training data, which validated the method's ability to extract meaningful signals. Furthermore, by integrating survival models as inputs in the 10K data, we uncovered associations between metabolic substances and obesity, demonstrating the ability to infer disease risk for future patients without requiring direct outcome labels. These findings highlight the potential of leveraging external bio-banks to extract valuable biomedical insights, even in data-limited scenarios. Our results demonstrate that machine learning models trained on smaller datasets can still be used to uncover real biological associations when carefully integrated with survival analysis and genetic studies.

Improving Diseases Predictions Utilizing External Bio-Banks

TL;DR

This work tackles the scarcity of disease labels in metabolomics by training LightGBM models on a metabolomics-rich, smaller dataset (10K) to impute missing metabolomics features in a large biobank (UKBB). The imputed features are then incorporated into survival analyses and GWAS to uncover biologically meaningful associations and validate signals that were absent from the training data. Notably, a genetic link between vascular dementia and smoking emerges from GWAS, and survival-model reintegration identifies obesity-related metabolic signals, demonstrating the value of cross-dataset integration. While imputation reveals important biological insights, direct improvements in disease outcome prediction are variable, underscoring the need for more robust multi-modal imputation and causal inference approaches for practical impact.

Abstract

Machine learning has been successfully used in critical domains, such as medicine. However, extracting meaningful insights from biomedical data is often constrained by the lack of their available disease labels. In this research, we demonstrate how machine learning can be leveraged to enhance explainability and uncover biologically meaningful associations, even when predictive improvements in disease modeling are limited. We train LightGBM models from scratch on our dataset (10K) to impute metabolomics features and apply them to the UK Biobank (UKBB) for downstream analysis. The imputed metabolomics features are then used in survival analysis to assess their impact on disease-related risk factors. As a result, our approach successfully identified biologically relevant connections that were not previously known to the predictive models. Additionally, we applied a genome-wide association study (GWAS) on key metabolomics features, revealing a link between vascular dementia and smoking. Although being a well-established epidemiological relationship, this link was not embedded in the model's training data, which validated the method's ability to extract meaningful signals. Furthermore, by integrating survival models as inputs in the 10K data, we uncovered associations between metabolic substances and obesity, demonstrating the ability to infer disease risk for future patients without requiring direct outcome labels. These findings highlight the potential of leveraging external bio-banks to extract valuable biomedical insights, even in data-limited scenarios. Our results demonstrate that machine learning models trained on smaller datasets can still be used to uncover real biological associations when carefully integrated with survival analysis and genetic studies.

Paper Structure

This paper contains 18 sections, 3 equations, 10 figures.

Figures (10)

  • Figure 1: Proposed Pipeline
  • Figure 2: Training results for models on 10K(val set) $R^2$ values for each metabolite stratified by gender, obtained using the proposed pipeline with both MSE and $R^2$-score loss functions.
  • Figure 3: Testing significance for metabolomics columns predictions on 10K(val set) Each blue columns represents an $R^2$ value of the pipeline with shuffled labels. The green columns on the right in the $R^2$ value given by the pipeline with the true labels.
  • Figure 4: Prediction results for omitted columns in the test dataset (UKBB). Both figures show the $R^2$ values given by running the main pipeline(shared features to metabolomics) when omitting all of the columns on the y-axis, and using them as labels. The columns are shows the results on the different data splits where train and test are taken from 10K, and test is taken solely from UKBB. The figures are separated by sex.
  • Figure 5: Predicting unseen modalities from the imputed metabolomics Sub-figures \ref{['bio_bank:fig:external_modalities_shared_10k']} and \ref{['bio_bank:fig:external_modalities_shared_ukbb']} shows the external modalities prediction results from the pipeline trained from the shared features, where sub-figures \ref{['bio_bank:fig:external_modalities_nmr_10k']} and \ref{['bio_bank:fig:external_modalities_nmr_ukbb']} shows the external modalities prediction results from the pipeline trained from the nmr features.
  • ...and 5 more figures