Table of Contents
Fetching ...

Domain-Adversarial Neural Network and Explainable AI for Reducing Tissue-of-Origin Signal in Pan-cancer Mortality Classification

Cristian Padron-Manrique, Juan José Oropeza Valdez, Osbaldo Resendis-Antonio

TL;DR

The paper tackles the challenge that tissue-of-origin signals dominate pan-cancer survival analyses, hindering the discovery of universal mortality biomarkers. It introduces a Domain-Adversarial Neural Network (DANN) trained on TCGA RNA-seq data to learn tissue-invariant representations focused on mortality, complemented by layer-aware SHAP analyses and SHAP-guided clustering to reveal pan-cancer survival subpopulations. Results show that DANN reduces tissue bias in representations, though vanilla input-space explainability remains tissue-dominated; however, SHAP-based explanations at hidden layers uncover clearer survival-relevant structure and enable identification of five prognostic gene clusters across cancers. The approach demonstrates the value of combining domain adaptation with layer-wise interpretability to isolate mortality signals from tissue noise, enabling more robust pan-cancer biomarker discovery and interpretable patient stratification. Collectively, this framework advances generalizable survival predictions across tumor types and provides a roadmap for layer-aware XAI in high-dimensional, multi-domain biomedical data.

Abstract

Tissue-of-origin signals dominate pan-cancer gene expression, often obscuring molecular features linked to patient survival. This hampers the discovery of generalizable biomarkers, as models tend to overfit tissue-specific patterns rather than capture survival-relevant signals. To address this, we propose a Domain-Adversarial Neural Network (DANN) trained on TCGA RNA-seq data to learn representations less biased by tissue and more focused on survival. Identifying tissue-independent genetic profiles is key to revealing core cancer programs. We assess the DANN using: (1) Standard SHAP, based on the original input space and DANN's mortality classifier; (2) A layer-aware strategy applied to hidden activations, including an unsupervised manifold from raw activations and a supervised manifold from mortality-specific SHAP values. Standard SHAP remains confounded by tissue signals due to biases inherent in its computation. The raw activation manifold was dominated by high-magnitude activations, which masked subtle tissue and mortality-related signals. In contrast, the layer-aware SHAP manifold offers improved low-dimensional representations of both tissue and mortality signals, independent of activation strength, enabling subpopulation stratification and pan-cancer identification of survival-associated genes.

Domain-Adversarial Neural Network and Explainable AI for Reducing Tissue-of-Origin Signal in Pan-cancer Mortality Classification

TL;DR

The paper tackles the challenge that tissue-of-origin signals dominate pan-cancer survival analyses, hindering the discovery of universal mortality biomarkers. It introduces a Domain-Adversarial Neural Network (DANN) trained on TCGA RNA-seq data to learn tissue-invariant representations focused on mortality, complemented by layer-aware SHAP analyses and SHAP-guided clustering to reveal pan-cancer survival subpopulations. Results show that DANN reduces tissue bias in representations, though vanilla input-space explainability remains tissue-dominated; however, SHAP-based explanations at hidden layers uncover clearer survival-relevant structure and enable identification of five prognostic gene clusters across cancers. The approach demonstrates the value of combining domain adaptation with layer-wise interpretability to isolate mortality signals from tissue noise, enabling more robust pan-cancer biomarker discovery and interpretable patient stratification. Collectively, this framework advances generalizable survival predictions across tumor types and provides a roadmap for layer-aware XAI in high-dimensional, multi-domain biomedical data.

Abstract

Tissue-of-origin signals dominate pan-cancer gene expression, often obscuring molecular features linked to patient survival. This hampers the discovery of generalizable biomarkers, as models tend to overfit tissue-specific patterns rather than capture survival-relevant signals. To address this, we propose a Domain-Adversarial Neural Network (DANN) trained on TCGA RNA-seq data to learn representations less biased by tissue and more focused on survival. Identifying tissue-independent genetic profiles is key to revealing core cancer programs. We assess the DANN using: (1) Standard SHAP, based on the original input space and DANN's mortality classifier; (2) A layer-aware strategy applied to hidden activations, including an unsupervised manifold from raw activations and a supervised manifold from mortality-specific SHAP values. Standard SHAP remains confounded by tissue signals due to biases inherent in its computation. The raw activation manifold was dominated by high-magnitude activations, which masked subtle tissue and mortality-related signals. In contrast, the layer-aware SHAP manifold offers improved low-dimensional representations of both tissue and mortality signals, independent of activation strength, enabling subpopulation stratification and pan-cancer identification of survival-associated genes.

Paper Structure

This paper contains 28 sections, 10 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Tissue-of-origin patterns dominate unsupervised transcriptomic projections. Unsupervised methods such as UMAP separate tissue-of-origin clusters but fail to achieve similar separation for vital status. Unsupervised methods guide the separation of tissue-of-origin clusters (left), enabling UMAP to delineate cancer panels based on transcriptomic profiles. However, for vital status (right), UMAP fails to achieve meaningful class separation between alive (red) and dead (blue) samples, highlighting the challenge of separating survival status with unsupervised methods.
  • Figure 2: Training and validation performance of the DANN model over 499 epochs. This figure presents the Domain-Adversarial Neural Network (DANN) training and validation performance metrics across 499 epochs. Top-left: Training and cross-validation loss for the label classifier. Top-right: Training and cross-validation loss for the domain classifier. Bottom-left: Training and cross-validation accuracy for the label classifier. Bottom-right: Training and cross-validation accuracy for the domain classifier. The shaded regions in each plot represent the standard deviation across cross-validation folds.
  • Figure 3: Normalized clustering scores reveal domain and survival structure changes during training. Normalized clustering scores (Calinski-Harabasz and Silhouette) computed on the 2D UMAP projections of layer activations from the DANN architecture over training epochs. Top panels reflect clustering quality for vital status (alive vs dead), while bottom panels show clustering based on domain labels (TCGA cancer types). Each line corresponds to a specific transformation layer within the model: feature_extractor.dropout1 (red), label_predictor.dropout2 (green), and domain_classifier.dropout2 (blue), which represent the last transformation in each respective path. LOWESS smoothing reveals a consistent decline in domain-related clustering—particularly pronounced in the label predictor pathway—while clustering by vital status becomes progressively stronger. This illustrates the model’s ability to suppress domain bias and enhance survival-relevant structure during training.
  • Figure 4: 2D UMAP projections of hidden activations reveal temporal evolution across DANN layers. 2D UMAP visualizations of hidden layer activations from three distinct Domain-Adversarial Neural Network (DANN) layers across selected training epochs: 1, 5, 10, 50, 70, 100, 300, and 499. The top three rows display the UMAP projections colored by vital status (red = alive, blue = dead), while the bottom three rows show the same projections colored by TCGA cancer type. Each row corresponds to a different dropout layer in the DANN architecture: feature_extractor.dropout1, label_predictor.dropout2, and domain_classifier.dropout2.
  • Figure 5: Normalized clustering scores from SHAP-based UMAP projections reveal survival-relevant structure. Normalized clustering scores (Calinski-Harabasz and Silhouette) computed on the 2D UMAP projections of SHAP values extracted from an XGBoost model trained on activations from different layers of the DANN architecture across training epochs. The top panels show clustering quality based on vital status (alive vs dead), while the bottom panels reflect clustering for domain labels (TCGA cancer types). SHAP values were computed for each layer separately and then projected using UMAP into two dimensions. Each line represents a different layer: feature_extractor.dropout1 (red), label_predictor.dropout2 (green), and domain_classifier.dropout2 (blue). By applying LOWESS smoothing, we observe a progressive increase in clustering structure related to survival outcomes, while clustering by cancer type diminishes, especially in the label predictor path. These trends reflect how interpretability patterns captured by SHAP align with the model’s learning dynamics, revealing disentanglement from tissue-of-origin signals and growing emphasis on features relevant to mortality. This type of SHAP-based clustering evaluation could also be used as a criterion to determine an optimal stopping point during model training.
  • ...and 7 more figures