Table of Contents
Fetching ...

Robust Gene Prioritization via Fast-mRMR Feature Selection in high-dimensional omics data

Rubén Fernández-Farelo, Jorge Paz-Ruza, Bertha Guijarro-Berdiñas, Amparo Alonso-Betanzos, Alex A. Freitas

TL;DR

This work tackles gene prioritization in high-dimensional omics data under positive-unlabeled labelling by introducing a pipeline that leverages Fast-mRMR feature selection to yield a compact, non-redundant feature set before classification. By integrating GO annotations and PathDIP pathways and training CatBoost and Balanced Random Forest on the reduced data, the approach achieves superior predictive performance and reveals the importance of feature selection in enabling synergy between disparate biological sources. On Dietary Restriction datasets, the GO+PathDIP combination with Fast-mRMR delivers the best results (e.g., AUC-ROC up to 0.872 and AUC-PR up to 0.607) with statistical significance, while also reducing training cost and improving scalability. The work highlights the value of dimensionality reduction for reliable gene prioritization and outlines future directions to broaden data sources, couple with PU Learning, and enhance interpretability of the selected features and top-ranked genes.

Abstract

Gene prioritization (identifying genes potentially associated with a biological process) is increasingly tackled with Artificial Intelligence. However, existing methods struggle with the high dimensionality and incomplete labelling of biomedical data. This work proposes a more robust and efficient pipeline that leverages Fast-mRMR feature selection to retain only relevant, non-redundant features for classifiers. This enables us to build simpler and more effective models, as well as to combine different biological feature sets. Experiments on Dietary Restriction datasets show significant improvements over existing methods, proving that feature selection can be critical for reliable gene prioritization.

Robust Gene Prioritization via Fast-mRMR Feature Selection in high-dimensional omics data

TL;DR

This work tackles gene prioritization in high-dimensional omics data under positive-unlabeled labelling by introducing a pipeline that leverages Fast-mRMR feature selection to yield a compact, non-redundant feature set before classification. By integrating GO annotations and PathDIP pathways and training CatBoost and Balanced Random Forest on the reduced data, the approach achieves superior predictive performance and reveals the importance of feature selection in enabling synergy between disparate biological sources. On Dietary Restriction datasets, the GO+PathDIP combination with Fast-mRMR delivers the best results (e.g., AUC-ROC up to 0.872 and AUC-PR up to 0.607) with statistical significance, while also reducing training cost and improving scalability. The work highlights the value of dimensionality reduction for reliable gene prioritization and outlines future directions to broaden data sources, couple with PU Learning, and enhance interpretability of the selected features and top-ranked genes.

Abstract

Gene prioritization (identifying genes potentially associated with a biological process) is increasingly tackled with Artificial Intelligence. However, existing methods struggle with the high dimensionality and incomplete labelling of biomedical data. This work proposes a more robust and efficient pipeline that leverages Fast-mRMR feature selection to retain only relevant, non-redundant features for classifiers. This enables us to build simpler and more effective models, as well as to combine different biological feature sets. Experiments on Dietary Restriction datasets show significant improvements over existing methods, proving that feature selection can be critical for reliable gene prioritization.

Paper Structure

This paper contains 8 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The proposed gene prioritization pipeline.
  • Figure 2: Computational efficiency analysis: cost breakdown per phase (left) and long-term cost evolution across thousands of inferences (right).