Table of Contents
Fetching ...

A Pan-cancer Classification Model using Multi-view Feature Selection Method and Ensemble Classifier

Tareque Mohmud Chowdhury, Farzana Tabassum, Sabrina Islam, Abu Raihan Mostofa Kamal

TL;DR

The study tackles pan-cancer classification from high-dimensional transcriptome data by introducing a multi-view feature selection framework based on partitioned Boruta features and two parallel ensemble classifiers. The method generates twelve ranked feature sets and identifies Rank10 (3515 features) as optimal, achieving 97.11% accuracy and near-perfect AUC on 33 TCGA cancers using average voting (avEns). It also demonstrates strong class-wise performance, particularly across tumors with similar tissue origins, and shows superior performance relative to existing literature, supported by GO/KEGG enrichment analyses of selected features. The approach offers a scalable, biologically grounded pipeline for high-dimensional cancer classification and can extend to binary or subtype analyses across other omics data.

Abstract

Accurately identifying cancer samples is crucial for precise diagnosis and effective patient treatment. Traditional methods falter with high-dimensional and high feature-to-sample count ratios, which are critical for classifying cancer samples. This study aims to develop a novel feature selection framework specifically for transcriptome data and propose two ensemble classifiers. For feature selection, we partition the transcriptome dataset vertically based on feature types. Then apply the Boruta feature selection process on each of the partitions, combine the results, and apply Boruta again on the combined result. We repeat the process with different parameters of Boruta and prepare the final feature set. Finally, we constructed two ensemble ML models based on LR, SVM and XGBoost classifiers with max voting and averaging probability approach. We used 10-fold cross-validation to ensure robust and reliable classification performance. With 97.11\% accuracy and 0.9996 AUC value, our approach performs better compared to existing state-of-the-art methods to classify 33 types of cancers. A set of 12 types of cancer is traditionally challenging to differentiate between each other due to their similarity in tissue of origin. Our method accurately identifies over 90\% of samples from these 12 types of cancers, which outperforms all known methods presented in existing literature. The gene set enrichment analysis reveals that our framework's selected features have enriched the pathways highly related to cancers. This study develops a feature selection framework to select features highly related to cancer development and leads to identifying different types of cancer samples with higher accuracy.

A Pan-cancer Classification Model using Multi-view Feature Selection Method and Ensemble Classifier

TL;DR

The study tackles pan-cancer classification from high-dimensional transcriptome data by introducing a multi-view feature selection framework based on partitioned Boruta features and two parallel ensemble classifiers. The method generates twelve ranked feature sets and identifies Rank10 (3515 features) as optimal, achieving 97.11% accuracy and near-perfect AUC on 33 TCGA cancers using average voting (avEns). It also demonstrates strong class-wise performance, particularly across tumors with similar tissue origins, and shows superior performance relative to existing literature, supported by GO/KEGG enrichment analyses of selected features. The approach offers a scalable, biologically grounded pipeline for high-dimensional cancer classification and can extend to binary or subtype analyses across other omics data.

Abstract

Accurately identifying cancer samples is crucial for precise diagnosis and effective patient treatment. Traditional methods falter with high-dimensional and high feature-to-sample count ratios, which are critical for classifying cancer samples. This study aims to develop a novel feature selection framework specifically for transcriptome data and propose two ensemble classifiers. For feature selection, we partition the transcriptome dataset vertically based on feature types. Then apply the Boruta feature selection process on each of the partitions, combine the results, and apply Boruta again on the combined result. We repeat the process with different parameters of Boruta and prepare the final feature set. Finally, we constructed two ensemble ML models based on LR, SVM and XGBoost classifiers with max voting and averaging probability approach. We used 10-fold cross-validation to ensure robust and reliable classification performance. With 97.11\% accuracy and 0.9996 AUC value, our approach performs better compared to existing state-of-the-art methods to classify 33 types of cancers. A set of 12 types of cancer is traditionally challenging to differentiate between each other due to their similarity in tissue of origin. Our method accurately identifies over 90\% of samples from these 12 types of cancers, which outperforms all known methods presented in existing literature. The gene set enrichment analysis reveals that our framework's selected features have enriched the pathways highly related to cancers. This study develops a feature selection framework to select features highly related to cancer development and leads to identifying different types of cancer samples with higher accuracy.
Paper Structure (14 sections, 7 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 14 sections, 7 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Boruta feature selection process in brief.
  • Figure 2: Flow diagram of the proposed feature selection approach and it's performance evaluation.
  • Figure 3: Framework of proposed ensemble models. a) In max voting fusion, the class label with the most votes will be predicted. b) In average voting, the final prediction is derived from an average of the prediction probability derived from component classifiers.
  • Figure 4: Area under the ROC curve for classification performance of 12 types of tumor with 10-fold cross validation by the average voting ensemble (avEns) model. Classifiers generally struggle to identify these 12 tumors accurately due to the similar tissue of origin issue.
  • Figure 5: A confusion matrix showing class-wise performance by the proposed averaging ensemble model (avEns) using 3515 features using the Rank10 selected feature set. The confusion matrix was produced by averaging the data from 10-fold cross-validation for 33 tumor classes.