Table of Contents
Fetching ...

Positive-Unlabelled Learning for identifying new candidate Dietary Restriction-related genes among Ageing-related genes

Jorge Paz-Ruza, Alex A. Freitas, Amparo Alonso-Betanzos, Bertha Guijarro-Berdiñas

TL;DR

The paper tackles identifying new dietary restriction (DR) related genes among ageing-related genes by reframing the task as positive–unlabelled (PU) learning. It introduces a two-step, similarity-based PU method that first extracts reliable negatives from unlabelled genes using a KNN-like approach with Jaccard similarity, then trains a classifier on positives and these reliable negatives. Across PathDIP and GO feature sets and CatBoost/BRF classifiers, the PU approach consistently outperforms the prior non-PU method (p<0.05) on F1, G-Mean, and AUC-ROC, while reducing computational cost by up to ~40% in the best case. The approach yields a more trustworthy ranking of candidate DR-related genes, including four novel genes (PRKAB1, PRKAB2, IRS2, PRKAG1) with literature-backed potential DR roles, underscoring the method’s practical value for guiding wet-lab validation.

Abstract

Dietary Restriction (DR) is one of the most popular anti-ageing interventions; recently, Machine Learning (ML) has been explored to identify potential DR-related genes among ageing-related genes, aiming to minimize costly wet lab experiments needed to expand our knowledge on DR. However, to train a model from positive (DR-related) and negative (non-DR-related) examples, the existing ML approach naively labels genes without known DR relation as negative examples, assuming that lack of DR-related annotation for a gene represents evidence of absence of DR-relatedness, rather than absence of evidence. This hinders the reliability of the negative examples (non-DR-related genes) and the method's ability to identify novel DR-related genes. This work introduces a novel gene prioritisation method based on the two-step Positive-Unlabelled (PU) Learning paradigm: using a similarity-based, KNN-inspired approach, our method first selects reliable negative examples among the genes without known DR associations. Then, these reliable negatives and all known positives are used to train a classifier that effectively differentiates DR-related and non-DR-related genes, which is finally employed to generate a more reliable ranking of promising genes for novel DR-relatedness. Our method significantly outperforms (p<0.05) the existing state-of-the-art approach in three predictive accuracy metrics with up to 40% lower computational cost in the best case, and we identify 4 new promising DR-related genes (PRKAB1, PRKAB2, IRS2, PRKAG1), all with evidence from the existing literature supporting their potential DR-related role.

Positive-Unlabelled Learning for identifying new candidate Dietary Restriction-related genes among Ageing-related genes

TL;DR

The paper tackles identifying new dietary restriction (DR) related genes among ageing-related genes by reframing the task as positive–unlabelled (PU) learning. It introduces a two-step, similarity-based PU method that first extracts reliable negatives from unlabelled genes using a KNN-like approach with Jaccard similarity, then trains a classifier on positives and these reliable negatives. Across PathDIP and GO feature sets and CatBoost/BRF classifiers, the PU approach consistently outperforms the prior non-PU method (p<0.05) on F1, G-Mean, and AUC-ROC, while reducing computational cost by up to ~40% in the best case. The approach yields a more trustworthy ranking of candidate DR-related genes, including four novel genes (PRKAB1, PRKAB2, IRS2, PRKAG1) with literature-backed potential DR roles, underscoring the method’s practical value for guiding wet-lab validation.

Abstract

Dietary Restriction (DR) is one of the most popular anti-ageing interventions; recently, Machine Learning (ML) has been explored to identify potential DR-related genes among ageing-related genes, aiming to minimize costly wet lab experiments needed to expand our knowledge on DR. However, to train a model from positive (DR-related) and negative (non-DR-related) examples, the existing ML approach naively labels genes without known DR relation as negative examples, assuming that lack of DR-related annotation for a gene represents evidence of absence of DR-relatedness, rather than absence of evidence. This hinders the reliability of the negative examples (non-DR-related genes) and the method's ability to identify novel DR-related genes. This work introduces a novel gene prioritisation method based on the two-step Positive-Unlabelled (PU) Learning paradigm: using a similarity-based, KNN-inspired approach, our method first selects reliable negative examples among the genes without known DR associations. Then, these reliable negatives and all known positives are used to train a classifier that effectively differentiates DR-related and non-DR-related genes, which is finally employed to generate a more reliable ranking of promising genes for novel DR-relatedness. Our method significantly outperforms (p<0.05) the existing state-of-the-art approach in three predictive accuracy metrics with up to 40% lower computational cost in the best case, and we identify 4 new promising DR-related genes (PRKAB1, PRKAB2, IRS2, PRKAG1), all with evidence from the existing literature supporting their potential DR-related role.
Paper Structure (17 sections, 8 equations, 5 figures, 5 tables, 2 algorithms)

This paper contains 17 sections, 8 equations, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: General overview of the two-step modelling to solve the task for proposing potential novel DR-related genes among ageing-related genes.
  • Figure 2: High-level structure of a two-step PU Learning technique.
  • Figure 3: Similarity-based reliable negative selection of the proposed PU Learning algorithm. The threshold $t$ is the minimum proportion of unlabelled examples among the $k$ nearest neighbours of an unlabelled example required to consider it a reliable negative ($k$ and $t$ are tunable hyperparameters; in this example, $k=5$ and $t=0.8$). Two different cases are shown: in case (a) the two conditions for a reliable negative are met, i.e. the gene's nearest neighbour and >80% of its $k$ nearest neighbours are unlabelled; the gene is confidently not related to DR and is added to the set of reliable negatives. In case (b), the latter condition is not met; since the gene is not dissimilar enough to known DR-related genes, the gene is not added as a reliable negative, avoiding potential label noise during the training of the classifier in the second step of the PU Learning.
  • Figure 4: Comparison of the computational cost (measured in grams of carbon dioxide equivalent (gCO$_2$e), lower is better) of Magdaleno et al.'s original non-PU approach and our proposed PU Learning-based approach for identification of new DR-related genes. Results are averaged over 10 complete executions of the nested cross-validation, involving training and inference procedures.
  • Figure B.1: Details of the F1 Score across 10 executions of the nested cross-validation for the existing method vega2022machine and our PU Learning-based proposal. For the best method, highlighted in blue (PU Learning on the {PathDIP, CAT} scenario), the $p$-value of the paired t-test against the best-performing scenario of the non-PU method is shown.