Table of Contents
Fetching ...

Dual-stage optimizer for systematic overestimation adjustment applied to multi-objective genetic algorithms for biomarker selection

Luca Cattelani, Vittorio Fortino

TL;DR

This work tackles the winner's curse in multi-objective biomarker feature selection under high feature dimensionality and limited samples. It introduces DOSA-MO, a dual-stage MO optimization framework that learns to predict overestimation from training performance, its variability, and feature-set size, and then uses this insight to adjust objectives in a second optimization pass. The authors also present two metrics, MOPE and Pareto delta, to quantify estimation error and solution-level deviations between train and external test performance. Across kidney and breast cancer transcriptomics with external validation, DOSA-MO improves model selection and biomarker discovery, with tree-based adjusters often delivering the strongest reduction in overestimation, and demonstrates broader applicability to MO problems beyond this domain.

Abstract

The challenge in biomarker discovery using machine learning from omics data lies in the abundance of molecular features but scarcity of samples. Most feature selection methods in machine learning require evaluating various sets of features (models) to determine the most effective combination. This process, typically conducted using a validation dataset, involves testing different feature sets to optimize the model's performance. Evaluations have performance estimation error and when the selection involves many models the best ones are almost certainly overestimated. Biomarker identification with feature selection methods can be addressed as a multi-objective problem with trade-offs between predictive ability and parsimony in the number of features. Genetic algorithms are a popular tool for multi-objective optimization but they evolve numerous solutions thus are prone to overestimation. Methods have been proposed to reduce the overestimation after a model has already been selected in single-objective problems, but no algorithm existed capable of reducing the overestimation during the optimization, improving model selection, or applied in the more general multi-objective domain. We propose DOSA-MO, a novel multi-objective optimization wrapper algorithm that learns how the original estimation, its variance, and the feature set size of the solutions predict the overestimation. DOSA-MO adjusts the expectation of the performance during the optimization, improving the composition of the solution set. We verify that DOSA-MO improves the performance of a state-of-the-art genetic algorithm on left-out or external sample sets, when predicting cancer subtypes and/or patient overall survival, using three transcriptomics datasets for kidney and breast cancer.

Dual-stage optimizer for systematic overestimation adjustment applied to multi-objective genetic algorithms for biomarker selection

TL;DR

This work tackles the winner's curse in multi-objective biomarker feature selection under high feature dimensionality and limited samples. It introduces DOSA-MO, a dual-stage MO optimization framework that learns to predict overestimation from training performance, its variability, and feature-set size, and then uses this insight to adjust objectives in a second optimization pass. The authors also present two metrics, MOPE and Pareto delta, to quantify estimation error and solution-level deviations between train and external test performance. Across kidney and breast cancer transcriptomics with external validation, DOSA-MO improves model selection and biomarker discovery, with tree-based adjusters often delivering the strongest reduction in overestimation, and demonstrates broader applicability to MO problems beyond this domain.

Abstract

The challenge in biomarker discovery using machine learning from omics data lies in the abundance of molecular features but scarcity of samples. Most feature selection methods in machine learning require evaluating various sets of features (models) to determine the most effective combination. This process, typically conducted using a validation dataset, involves testing different feature sets to optimize the model's performance. Evaluations have performance estimation error and when the selection involves many models the best ones are almost certainly overestimated. Biomarker identification with feature selection methods can be addressed as a multi-objective problem with trade-offs between predictive ability and parsimony in the number of features. Genetic algorithms are a popular tool for multi-objective optimization but they evolve numerous solutions thus are prone to overestimation. Methods have been proposed to reduce the overestimation after a model has already been selected in single-objective problems, but no algorithm existed capable of reducing the overestimation during the optimization, improving model selection, or applied in the more general multi-objective domain. We propose DOSA-MO, a novel multi-objective optimization wrapper algorithm that learns how the original estimation, its variance, and the feature set size of the solutions predict the overestimation. DOSA-MO adjusts the expectation of the performance during the optimization, improving the composition of the solution set. We verify that DOSA-MO improves the performance of a state-of-the-art genetic algorithm on left-out or external sample sets, when predicting cancer subtypes and/or patient overall survival, using three transcriptomics datasets for kidney and breast cancer.
Paper Structure (14 sections, 2 equations, 8 figures, 1 table)

This paper contains 14 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Sequence of operations for external validation of DOSA-MO. A MO problem is defined with multiple objectives, e.g. cancer subtype classification, prediction of survival, and parsimony in the feature set size. A dataset (e.g. TCGA breast omics data) is fed to DOSA-MO for its optimization process in 3 steps. In step 1 it performs a k-fold CV with a wrapped MO optimizer (e.g. the NSGA3-CHS GA) and collects the solutions from all the folds. From each solution and objective a sample is constructed. It has the fitness expected by the wrapped MO, its SD, and the feature set size as independent variables, the overestimation (expected fitness minus fitness assessed on left-out set) as dependent variable, and the partial derivative on the HV with respect to this fitness measurement as sample weight. In step 2 these samples are used to train regression models for overestimation, and new adjusted objective functions are created. In step 3 a wrapped MO optimizer is run with the adjusted objective functions, impacting how the models are selected. The solution set is the output of DOSA-MO (e.g. a set of biomarkers). It might be beneficial to use a faster wrapped optimizer in step 1 than in step 3 since step 1 uses k-fold CV. An external dataset (e.g. SCAN-B) is used to externally validate the solution set.
  • Figure 2: MultiObjectiveOptimizer abstract class definition.
  • Figure 3: DosaMO class definition.
  • Figure 4: trainAdjuster function definition.
  • Figure 5: Scatter plots depicting solutions from external validation on breast cancer transcriptomics data using SVM as inner model. MO optimization of balanced accuracy for subtypes prediction and root-leanness. Horizontally, the number of features is depicted for simplicity. For each solution it is shown the performance measured in the inner CV, i.e. the performance expected by the optimizer, the performance of the model trained on the TCGA breast set and tested on the same set, and the performance of the same model on the external SCAN-B set. The lines are interpolating splines. (a) Using the unadjusted optimizer. (b) Using SVR as regression model for fitness adjustment. (c) Using RFReg as regression model for fitness adjustment.
  • ...and 3 more figures