BoMGene: Integrating Boruta-mRMR feature selection for enhanced Gene expression classification
Bich-Chung Phan, Thanh Ma, Huu-Hoa Nguyen, Thanh-Nghi Do
TL;DR
BoMGene tackles the high-dimensionality of gene expression classification by integrating two complementary feature-selection strategies: mRMR for global relevance and redundancy reduction, and Boruta for local, interaction-aware refinement using shadow features. The two-stage pipeline reduces feature counts substantially while maintaining or improving classification accuracy across four learners (SVM, RF, XGBoost, GBM) on 25 public GED datasets, and it achieves faster, more stable training times than baseline methods. The approach demonstrates strong practical value for multi-class GEDC, enabling efficient model deployment and potential biomarker discovery. Future work aims to enhance the underlying relevance/redundancy measures and to incorporate data-augmentation to further mitigate overfitting and class imbalance.
Abstract
Feature selection is a crucial step in analyzing gene expression data, enhancing classification performance, and reducing computational costs for high-dimensional datasets. This paper proposes BoMGene, a hybrid feature selection method that effectively integrates two popular techniques: Boruta and Minimum Redundancy Maximum Relevance (mRMR). The method aims to optimize the feature space and enhance classification accuracy. Experiments were conducted on 25 publicly available gene expression datasets, employing widely used classifiers such as Support Vector Machine (SVM), Random Forest, XGBoost (XGB), and Gradient Boosting Machine (GBM). The results show that using the Boruta-mRMR combination cuts down the number of features chosen compared to just using mRMR, which helps to speed up training time while keeping or even improving classification accuracy compared to using individual feature selection methods. The proposed approach demonstrates clear advantages in accuracy, stability, and practical applicability for multi-class gene expression data analysis
