Unified Modeling Enhanced Multimodal Learning for Precision Neuro-Oncology
Huahui Yi, Xiaofei Wang, Kang Li, Chao Li
TL;DR
The paper addresses precision neuro-oncology by fusing histopathology and genomics through Unified Modeling Enhanced Multimodal Learning (UMEML), which employs a hierarchical attention structure to capture both shared and complementary information. It introduces two unimodal encoders (pathology and genomics) and a Unified Multimodal Decoder, augmented by a query-based cross-attention that clusters pathology patches into prototypes, a prototype assignment with a modularity loss $\mathcal{L}_{\text{modularity}} = -\frac{1}{2e}\big(\alpha \mathrm{Tr}(W (S^p)^T S^p) + \beta \mathrm{Tr}(W (S^g)^T S^g)\big)$ and a total loss $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{objective}} + \gamma \mathcal{L}_{\text{modularity}}$, plus a registration mechanism with learnable tokens. It demonstrates state-of-the-art results on TCGA GBM-LGG across glioma grading, classification, and survival (e.g., grading Acc 0.7756, AUC 0.9212; classification Acc 0.7514, AUC 0.9594; survival c-index 0.8396). Ablation studies confirm the importance of modularity loss, the Unified Multimodal Decoder, and the register tokens. This work advances multimodal fusion for precision neuro-oncology and suggests paths for handling missing modalities in the future.
Abstract
Multimodal learning, integrating histology images and genomics, promises to enhance precision oncology with comprehensive views at microscopic and molecular levels. However, existing methods may not sufficiently model the shared or complementary information for more effective integration. In this study, we introduce a Unified Modeling Enhanced Multimodal Learning (UMEML) framework that employs a hierarchical attention structure to effectively leverage shared and complementary features of both modalities of histology and genomics. Specifically, to mitigate unimodal bias from modality imbalance, we utilize a query-based cross-attention mechanism for prototype clustering in the pathology encoder. Our prototype assignment and modularity strategy are designed to align shared features and minimizes modality gaps. An additional registration mechanism with learnable tokens is introduced to enhance cross-modal feature integration and robustness in multimodal unified modeling. Our experiments demonstrate that our method surpasses previous state-of-the-art approaches in glioma diagnosis and prognosis tasks, underscoring its superiority in precision neuro-Oncology.
