Table of Contents
Fetching ...

Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization

Yupei Zhang, Xiaofei Wang, Anran Liu, Lequan Yu, Chao Li

TL;DR

A disentangled multi-modal framework with four contributions to mitigate multi-modal heterogeneity, decompose WSIs and transcriptomes into tumor and microenvironment subspaces using a disentangled multi-modal fusion module, and introduces a confidence-guided gradient coordination strategy to balance subspace optimization.

Abstract

Histopathology remains the gold standard for cancer diagnosis and prognosis. With the advent of transcriptome profiling, multi-modal learning combining transcriptomics with histology offers more comprehensive information. However, existing multi-modal approaches are challenged by intrinsic multi-modal heterogeneity, insufficient multi-scale integration, and reliance on paired data, restricting clinical applicability. To address these challenges, we propose a disentangled multi-modal framework with four contributions: 1) To mitigate multi-modal heterogeneity, we decompose WSIs and transcriptomes into tumor and microenvironment subspaces using a disentangled multi-modal fusion module, and introduce a confidence-guided gradient coordination strategy to balance subspace optimization. 2) To enhance multi-scale integration, we propose an inter-magnification gene-expression consistency strategy that aligns transcriptomic signals across WSI magnifications. 3) To reduce dependency on paired data, we propose a subspace knowledge distillation strategy enabling transcriptome-agnostic inference through a WSI-only student model. 4) To improve inference efficiency, we propose an informative token aggregation module that suppresses WSI redundancy while preserving subspace semantics. Extensive experiments on cancer diagnosis, prognosis, and survival prediction demonstrate our superiority over state-of-the-art methods across multiple settings. Code is available at https://github.com/helenypzhang/Disentangled-Multimodal-Learning.

Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization

TL;DR

A disentangled multi-modal framework with four contributions to mitigate multi-modal heterogeneity, decompose WSIs and transcriptomes into tumor and microenvironment subspaces using a disentangled multi-modal fusion module, and introduces a confidence-guided gradient coordination strategy to balance subspace optimization.

Abstract

Histopathology remains the gold standard for cancer diagnosis and prognosis. With the advent of transcriptome profiling, multi-modal learning combining transcriptomics with histology offers more comprehensive information. However, existing multi-modal approaches are challenged by intrinsic multi-modal heterogeneity, insufficient multi-scale integration, and reliance on paired data, restricting clinical applicability. To address these challenges, we propose a disentangled multi-modal framework with four contributions: 1) To mitigate multi-modal heterogeneity, we decompose WSIs and transcriptomes into tumor and microenvironment subspaces using a disentangled multi-modal fusion module, and introduce a confidence-guided gradient coordination strategy to balance subspace optimization. 2) To enhance multi-scale integration, we propose an inter-magnification gene-expression consistency strategy that aligns transcriptomic signals across WSI magnifications. 3) To reduce dependency on paired data, we propose a subspace knowledge distillation strategy enabling transcriptome-agnostic inference through a WSI-only student model. 4) To improve inference efficiency, we propose an informative token aggregation module that suppresses WSI redundancy while preserving subspace semantics. Extensive experiments on cancer diagnosis, prognosis, and survival prediction demonstrate our superiority over state-of-the-art methods across multiple settings. Code is available at https://github.com/helenypzhang/Disentangled-Multimodal-Learning.

Paper Structure

This paper contains 35 sections, 17 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Framework overview. Left: Multi-modal inputs (with (a) disentangled transcriptomic profiles and (b) multi-scale WSI embeddings at $10\times$ and $20\times$ magnification), and Multi-scale architecture (with the DEV loss applied across scales in Stage I). Right: Two-stage framework, where Stage I learns subspace-aware multi-modal representations and Stage II performs multi-modal distillation. Only WSIs are required at inference.
  • Figure 2: Findings for model design. (a) Performance on cancer diagnosis with WSI-only, transcriptomic-only, and WSI-transcriptome integration by concatenation. (b) C-index of different input forms of tumor and TME-related genes. (c) Accuracy with different scales WSIs as input in cancer grading with cross-attention (C-Att) or deformable attention (D-Att) attention mechanisms.
  • Figure 3: The Confidence-guided Gradient Coordination strategy. The confidence scores $S^T$ and $S^E$ are calculated with softmax after subspace logits. The less confident gradient is projected onto the orthogonal complement of the more confident one.
  • Figure 4: The Inter-magnification Gene-Expression Consistency Strategy. A cross-scale similarity matrix is utilized to measure the inter-magnification consistency. The sample-wise inter-magnification consistency is constrained by a Diagonal Element Variance loss.
  • Figure 5: Visualization of feature representation using Ours (teacher) and MCAT in glioma diagnosis on TCGA GBM-LGG datasets. Our teacher exhibits more distinct clustering, particularly in distinguishing low-grade astrocytoma and oligodendroglioma cases, as circled in red.
  • ...and 3 more figures