Table of Contents
Fetching ...

Adversarial Domain Adaptation Enables Knowledge Transfer Across Heterogeneous RNA-Seq Datasets

Kevin Dradjat, Massinissa Hamidi, Blaise Hanczar

TL;DR

This study proposes a deep learning-based domain adaptation framework that enables effective knowledge transfer from a large general dataset to a smaller one for cancer type classification, and demonstrates consistent improvements in cancer and tissue type classification accuracy compared to non-adaptive baselines.

Abstract

Accurate phenotype prediction from RNA sequencing (RNA-seq) data is essential for diagnosis, biomarker discovery, and personalized medicine. Deep learning models have demonstrated strong potential to outperform classical machine learning approaches, but their performance relies on large, well-annotated datasets. In transcriptomics, such datasets are frequently limited, leading to over-fitting and poor generalization. Knowledge transfer from larger, more general datasets can alleviate this issue. However, transferring information across RNA-seq datasets remains challenging due to heterogeneous preprocessing pipelines and differences in target phenotypes. In this study, we propose a deep learning-based domain adaptation framework that enables effective knowledge transfer from a large general dataset to a smaller one for cancer type classification. The method learns a domain-invariant latent space by jointly optimizing classification and domain alignment objectives. To ensure stable training and robustness in data-scarce scenarios, the framework is trained with an adversarial approach with appropriate regularization. Both supervised and unsupervised approach variants are explored, leveraging labeled or unlabeled target samples. The framework is evaluated on three large-scale transcriptomic datasets (TCGA, ARCHS4, GTEx) to assess its ability to transfer knowledge across cohorts. Experimental results demonstrate consistent improvements in cancer and tissue type classification accuracy compared to non-adaptive baselines, particularly in low-data scenarios. Overall, this work highlights domain adaptation as a powerful strategy for data-efficient knowledge transfer in transcriptomics, enabling robust phenotype prediction under constrained data conditions.

Adversarial Domain Adaptation Enables Knowledge Transfer Across Heterogeneous RNA-Seq Datasets

TL;DR

This study proposes a deep learning-based domain adaptation framework that enables effective knowledge transfer from a large general dataset to a smaller one for cancer type classification, and demonstrates consistent improvements in cancer and tissue type classification accuracy compared to non-adaptive baselines.

Abstract

Accurate phenotype prediction from RNA sequencing (RNA-seq) data is essential for diagnosis, biomarker discovery, and personalized medicine. Deep learning models have demonstrated strong potential to outperform classical machine learning approaches, but their performance relies on large, well-annotated datasets. In transcriptomics, such datasets are frequently limited, leading to over-fitting and poor generalization. Knowledge transfer from larger, more general datasets can alleviate this issue. However, transferring information across RNA-seq datasets remains challenging due to heterogeneous preprocessing pipelines and differences in target phenotypes. In this study, we propose a deep learning-based domain adaptation framework that enables effective knowledge transfer from a large general dataset to a smaller one for cancer type classification. The method learns a domain-invariant latent space by jointly optimizing classification and domain alignment objectives. To ensure stable training and robustness in data-scarce scenarios, the framework is trained with an adversarial approach with appropriate regularization. Both supervised and unsupervised approach variants are explored, leveraging labeled or unlabeled target samples. The framework is evaluated on three large-scale transcriptomic datasets (TCGA, ARCHS4, GTEx) to assess its ability to transfer knowledge across cohorts. Experimental results demonstrate consistent improvements in cancer and tissue type classification accuracy compared to non-adaptive baselines, particularly in low-data scenarios. Overall, this work highlights domain adaptation as a powerful strategy for data-efficient knowledge transfer in transcriptomics, enabling robust phenotype prediction under constrained data conditions.
Paper Structure (22 sections, 1 equation, 5 figures, 1 table)

This paper contains 22 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: Conceptual illustration of domain adaptation process.
  • Figure 2: Composition of the domain adaptation pipeline. $E$, $C$ and $D$ denote, respectively, the encoder, the phenotype classifier and the domain discriminator. The unsupervised variant is defined by only the solid arrows, whereas we add the dotted arrows for the supervised variant. The discriminator loss can either be the Wasserstein distance or cross-entropy.
  • Figure 3: UMAP visualisations after domain adaptation for two study cases: (a) TCGA-target and (b) GTEx-target. Top row shows embeddings coloured by domain and bottom row shows embeddings coloured by class. (Top) TCGA dataset, (Bottom) GTEx dataset.
  • Figure 4: Performances of each adaptation methods by considering the TCGA dataset as target (a) and GTEx dataset as target (b).
  • Figure 5: Test target accuracy for different proportion of source-domain training used for (a) TCGA and (b) GTEx targets. Each model was trained with an increasing proportion of source samples while keeping labeled target proportion fixed to $0.01$ ($\sim$100 examples).