Positive-Unlabelled Learning for identifying new candidate Dietary Restriction-related genes among Ageing-related genes

Jorge Paz-Ruza; Alex A. Freitas; Amparo Alonso-Betanzos; Bertha Guijarro-Berdiñas

Positive-Unlabelled Learning for identifying new candidate Dietary Restriction-related genes among Ageing-related genes

Jorge Paz-Ruza, Alex A. Freitas, Amparo Alonso-Betanzos, Bertha Guijarro-Berdiñas

TL;DR

The paper tackles identifying new dietary restriction (DR) related genes among ageing-related genes by reframing the task as positive–unlabelled (PU) learning. It introduces a two-step, similarity-based PU method that first extracts reliable negatives from unlabelled genes using a KNN-like approach with Jaccard similarity, then trains a classifier on positives and these reliable negatives. Across PathDIP and GO feature sets and CatBoost/BRF classifiers, the PU approach consistently outperforms the prior non-PU method (p<0.05) on F1, G-Mean, and AUC-ROC, while reducing computational cost by up to ~40% in the best case. The approach yields a more trustworthy ranking of candidate DR-related genes, including four novel genes (PRKAB1, PRKAB2, IRS2, PRKAG1) with literature-backed potential DR roles, underscoring the method’s practical value for guiding wet-lab validation.

Abstract

Dietary Restriction (DR) is one of the most popular anti-ageing interventions; recently, Machine Learning (ML) has been explored to identify potential DR-related genes among ageing-related genes, aiming to minimize costly wet lab experiments needed to expand our knowledge on DR. However, to train a model from positive (DR-related) and negative (non-DR-related) examples, the existing ML approach naively labels genes without known DR relation as negative examples, assuming that lack of DR-related annotation for a gene represents evidence of absence of DR-relatedness, rather than absence of evidence. This hinders the reliability of the negative examples (non-DR-related genes) and the method's ability to identify novel DR-related genes. This work introduces a novel gene prioritisation method based on the two-step Positive-Unlabelled (PU) Learning paradigm: using a similarity-based, KNN-inspired approach, our method first selects reliable negative examples among the genes without known DR associations. Then, these reliable negatives and all known positives are used to train a classifier that effectively differentiates DR-related and non-DR-related genes, which is finally employed to generate a more reliable ranking of promising genes for novel DR-relatedness. Our method significantly outperforms (p<0.05) the existing state-of-the-art approach in three predictive accuracy metrics with up to 40% lower computational cost in the best case, and we identify 4 new promising DR-related genes (PRKAB1, PRKAB2, IRS2, PRKAG1), all with evidence from the existing literature supporting their potential DR-related role.

Positive-Unlabelled Learning for identifying new candidate Dietary Restriction-related genes among Ageing-related genes

TL;DR

Abstract

Paper Structure (17 sections, 8 equations, 5 figures, 5 tables, 2 algorithms)

This paper contains 17 sections, 8 equations, 5 figures, 5 tables, 2 algorithms.

Introduction
Background
Task Formulation
Existing Methodology for DR-Related Gene Identification
Essential Notions of PU Learning
The Proposed PU Learning Method
Experimental Setup
Features and Classifiers
Evaluation Details
Implementation Details
Results
Results of Computational Experiments
Analysis of Most Important Predictive features
Analysis of the Most Promising Candidate DR-related Genes
Conclusions
...and 2 more sections

Figures (5)

Figure 1: General overview of the two-step modelling to solve the task for proposing potential novel DR-related genes among ageing-related genes.
Figure 2: High-level structure of a two-step PU Learning technique.
Figure 3: Similarity-based reliable negative selection of the proposed PU Learning algorithm. The threshold $t$ is the minimum proportion of unlabelled examples among the $k$ nearest neighbours of an unlabelled example required to consider it a reliable negative ($k$ and $t$ are tunable hyperparameters; in this example, $k=5$ and $t=0.8$). Two different cases are shown: in case (a) the two conditions for a reliable negative are met, i.e. the gene's nearest neighbour and >80% of its $k$ nearest neighbours are unlabelled; the gene is confidently not related to DR and is added to the set of reliable negatives. In case (b), the latter condition is not met; since the gene is not dissimilar enough to known DR-related genes, the gene is not added as a reliable negative, avoiding potential label noise during the training of the classifier in the second step of the PU Learning.
Figure 4: Comparison of the computational cost (measured in grams of carbon dioxide equivalent (gCO$_2$e), lower is better) of Magdaleno et al.'s original non-PU approach and our proposed PU Learning-based approach for identification of new DR-related genes. Results are averaged over 10 complete executions of the nested cross-validation, involving training and inference procedures.
Figure B.1: Details of the F1 Score across 10 executions of the nested cross-validation for the existing method vega2022machine and our PU Learning-based proposal. For the best method, highlighted in blue (PU Learning on the {PathDIP, CAT} scenario), the $p$-value of the paired t-test against the best-performing scenario of the non-PU method is shown.

Positive-Unlabelled Learning for identifying new candidate Dietary Restriction-related genes among Ageing-related genes

TL;DR

Abstract

Positive-Unlabelled Learning for identifying new candidate Dietary Restriction-related genes among Ageing-related genes

Authors

TL;DR

Abstract

Table of Contents

Figures (5)