Table of Contents
Fetching ...

Genetic algorithms for multi-omic feature selection: a comparative study in cancer survival analysis

Luca Cattelani, Vittorio Fortino

Abstract

Multi-omic datasets offer opportunities for improved biomarker discovery in cancer research, but their high dimensionality and limited sample sizes make identifying compact and effective biomarker panels challenging. Feature selection in large-scale omics can be efficiently addressed by combining machine learning with genetic algorithms, which naturally support multi-objective optimization of predictive accuracy and biomarker set size. However, genetic algorithms remain relatively underexplored for multi-omic feature selection, where most approaches concatenate all layers into a single feature space. To address this limitation, we introduce Sweeping*, a multi-view, multi-objective algorithm alternating between single- and multi-view optimization. It employs a nested single-view multi-objective optimizer, and for this study we use the genetic algorithm NSGA3-CHS. It first identifies informative biomarkers within each layer, then jointly evaluates cross-layer interactions; these multi-omic solutions guide the next single-view search. Through repeated sweeps, the algorithm progressively identifies compact biomarker panels capturing cross-modal complementary signals. We benchmark five Sweeping* strategies, including hierarchical and concatenation-based variants, using survival prediction on three TCGA cohorts. Each strategy jointly optimizes predictive accuracy and set size, measured via the concordance index and root-leanness. Overall performance and estimation error are assessed through cross hypervolume and Pareto delta under 5-fold cross-validation. Our results show that Sweeping* can improve the accuracy-complexity trade-off when sufficient survival signal is present and that integrating omic layers can enhance survival prediction beyond clinical-only models, although benefits remain cohort-dependent.

Genetic algorithms for multi-omic feature selection: a comparative study in cancer survival analysis

Abstract

Multi-omic datasets offer opportunities for improved biomarker discovery in cancer research, but their high dimensionality and limited sample sizes make identifying compact and effective biomarker panels challenging. Feature selection in large-scale omics can be efficiently addressed by combining machine learning with genetic algorithms, which naturally support multi-objective optimization of predictive accuracy and biomarker set size. However, genetic algorithms remain relatively underexplored for multi-omic feature selection, where most approaches concatenate all layers into a single feature space. To address this limitation, we introduce Sweeping*, a multi-view, multi-objective algorithm alternating between single- and multi-view optimization. It employs a nested single-view multi-objective optimizer, and for this study we use the genetic algorithm NSGA3-CHS. It first identifies informative biomarkers within each layer, then jointly evaluates cross-layer interactions; these multi-omic solutions guide the next single-view search. Through repeated sweeps, the algorithm progressively identifies compact biomarker panels capturing cross-modal complementary signals. We benchmark five Sweeping* strategies, including hierarchical and concatenation-based variants, using survival prediction on three TCGA cohorts. Each strategy jointly optimizes predictive accuracy and set size, measured via the concordance index and root-leanness. Overall performance and estimation error are assessed through cross hypervolume and Pareto delta under 5-fold cross-validation. Our results show that Sweeping* can improve the accuracy-complexity trade-off when sufficient survival signal is present and that integrating omic layers can enhance survival prediction beyond clinical-only models, although benefits remain cohort-dependent.

Paper Structure

This paper contains 11 sections, 3 figures.

Figures (3)

  • Figure 1: CHV and $P_{\Delta}$ across TCGA cohorts. (a) CHV for each multi-omic feature selection strategy evaluated in this study across TCGA-KIRC, TCGA-LGG, and TCGA-SARC. CHV is a MO measure of overall performance that can be applied to cross-validation situations. The objectives here are the predictive performance and the biomarker set size (number of selected features). Higher values indicate better overall utility of solution sets. (b) $P_{\Delta}$ measures the discrepancy between training and testing performance of the solutions across the whole approximated Pareto front, when objectives are based on ML performance metrics. $P_{\Delta}$ quantifies overestimation, with lower values indicating smaller train–test gaps and more robust performance. Error bars represent variability across cross-validation folds.
  • Figure 2: Accuracy-complexity trade-offs across TCGA cohorts. C-index as a function of the number of selected features for each multi-omic feature selection strategy in TCGA-KIRC, TCGA-LGG, and TCGA-SARC. Each curve represents the best average performance achieved across cross-validation folds at a given feature count, illustrating the empirical trade-off between predictive accuracy and biomarker set size. Strategies include clinical-only models, direct concatenation of omic layers, and sweeping-based MV optimization variants. The figure highlights how different optimization designs navigate the accuracy–sparsity spectrum under distinct survival data regimes.
  • Figure 3: Expected versus measured C-index comparison of clinical-only and multi-omic feature selection strategies in TCGA-KIRC and TCGA-LGG. Top panels show C-index as a function of the number of selected features. “Expected” points represent internal cross-validation performance, while “measured” points denote testing performance on the left-out sets. The results from all the 5 folds are plotted together. Left column corresponds to clinical-only optimization, and right column to the multi-omic sweeping strategy integrating clinical with expression-based features. Bottom panels display the distribution of selected feature counts required to achieve increasing concordance levels, illustrating model complexity and composition. They show the counts and the expected C-index resulting from optimizing on the whole dataset. The comparison highlights cohort-dependent differences in incremental molecular value and generalization stability.