Table of Contents
Fetching ...

A Comparative Analysis of Gene Expression Profiling by Statistical and Machine Learning Approaches

Myriam Bontonou, Anaïs Haget, Maria Boulougouri, Benjamin Audit, Pierre Borgnat, Jean-Michel Arbona

TL;DR

The paper investigates whether machine-learning–derived explanations of gene-expression profiles yield biologically meaningful biomarkers beyond traditional differential-expression analyses. It systematically compares explanations from integrated gradients applied to LR, MLP, and GNN classifiers with classic methods like EdgeR, DESeq2, and mutual information, across diverse cancer and healthy-tissue datasets. The findings show that small gene sets can achieve strong classification, but the top-ranked genes differ substantially by method, and statistical differential-expression methods often perform as well or better with fewer genes. Over-representation analyses reveal that different methods point to distinct biological processes, and the interpretability of ML-based explanations is not universally superior, underscoring the need to study pathway-level signatures and cellular processes for robust biomarker discovery.

Abstract

Many machine learning models have been proposed to classify phenotypes from gene expression data. In addition to their good performance, these models can potentially provide some understanding of phenotypes by extracting explanations for their decisions. These explanations often take the form of a list of genes ranked in order of importance for the predictions, the highest-ranked genes being interpreted as linked to the phenotype. We discuss the biological and the methodological limitations of such explanations. Experiments are performed on several datasets gathering cancer and healthy tissue samples from the TCGA, GTEx and TARGET databases. A collection of machine learning models including logistic regression, multilayer perceptron, and graph neural network are trained to classify samples according to their cancer type. Gene rankings are obtained from explainability methods adapted to these models, and compared to the ones from classical statistical feature selection methods such as mutual information, DESeq2, and EdgeR. Interestingly, on simple tasks, we observe that the information learned by black-box neural networks is related to the notion of differential expression. In all cases, a small set containing the best-ranked genes is sufficient to achieve a good classification. However, these genes differ significantly between the methods and similar classification performance can be achieved with numerous lower ranked genes. In conclusion, although these methods enable the identification of biomarkers characteristic of certain pathologies, our results question the completeness of the selected gene sets and thus of explainability by the identification of the underlying biological processes.

A Comparative Analysis of Gene Expression Profiling by Statistical and Machine Learning Approaches

TL;DR

The paper investigates whether machine-learning–derived explanations of gene-expression profiles yield biologically meaningful biomarkers beyond traditional differential-expression analyses. It systematically compares explanations from integrated gradients applied to LR, MLP, and GNN classifiers with classic methods like EdgeR, DESeq2, and mutual information, across diverse cancer and healthy-tissue datasets. The findings show that small gene sets can achieve strong classification, but the top-ranked genes differ substantially by method, and statistical differential-expression methods often perform as well or better with fewer genes. Over-representation analyses reveal that different methods point to distinct biological processes, and the interpretability of ML-based explanations is not universally superior, underscoring the need to study pathway-level signatures and cellular processes for robust biomarker discovery.

Abstract

Many machine learning models have been proposed to classify phenotypes from gene expression data. In addition to their good performance, these models can potentially provide some understanding of phenotypes by extracting explanations for their decisions. These explanations often take the form of a list of genes ranked in order of importance for the predictions, the highest-ranked genes being interpreted as linked to the phenotype. We discuss the biological and the methodological limitations of such explanations. Experiments are performed on several datasets gathering cancer and healthy tissue samples from the TCGA, GTEx and TARGET databases. A collection of machine learning models including logistic regression, multilayer perceptron, and graph neural network are trained to classify samples according to their cancer type. Gene rankings are obtained from explainability methods adapted to these models, and compared to the ones from classical statistical feature selection methods such as mutual information, DESeq2, and EdgeR. Interestingly, on simple tasks, we observe that the information learned by black-box neural networks is related to the notion of differential expression. In all cases, a small set containing the best-ranked genes is sufficient to achieve a good classification. However, these genes differ significantly between the methods and similar classification performance can be achieved with numerous lower ranked genes. In conclusion, although these methods enable the identification of biomarkers characteristic of certain pathologies, our results question the completeness of the selected gene sets and thus of explainability by the identification of the underlying biological processes.
Paper Structure (35 sections, 10 equations, 11 figures, 6 tables)

This paper contains 35 sections, 10 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Illustration of PGI and PGU: prediction gaps on important (PGI) and unimportant (PGU) features, for an example $\mathbf{x}$ in a class $c$.
  • Figure 2: Heatmaps showing the percentage of common genes among the top 10 (lower) and top 100 (upper + diagonal) genes selected by each method. More details are in Materials and methods.
  • Figure 3: Impact of progressive gene masking on the predictions of ML models (experiment 0). Genes are masked by increasing (PGU) or decreasing order of importance (PGI) based on the rankings $\boldsymbol{\phi}^\text{IG}$. For each data sample, PGU calculates the percentage of well-ranked genes that should remain unmasked to avoid disturbing a trained model. $100 -$PGI estimates the percentage of well-ranked genes that can be masked before disturbing the model. PGs are averaged over all training samples correctly classified. Error bars are standard deviations across replicates.
  • Figure 4: Classification performance shown for models trained on features identified as important (full lines, experiment 1) or unimportant (dashed lines, experiment 2). Balanced accuracies are reported as a function of the number of kept features for ttg-breast (a) and BRCA-pam (b) datasets using the specified models. Error bars are std from 10 replicates.
  • Figure 5: Classification performance of a MLP trained on sets of features identified as important by various methods as indicated. The representation is coded as in Fig. \ref{['fig:classif_perf']}.
  • ...and 6 more figures