Table of Contents
Fetching ...

HEALNet: Multimodal Fusion for Heterogeneous Biomedical Data

Konstantin Hemker, Nikola Simidjievski, Mateja Jamnik

TL;DR

A flexible multimodal fusion architecture that preserves modality-specific structural information, can effectively handle missing modalities during training and inference, and enables intuitive model inspection by learning on the raw data input instead of opaque embeddings is presented.

Abstract

Technological advances in medical data collection, such as high-throughput genomic sequencing and digital high-resolution histopathology, have contributed to the rising requirement for multimodal biomedical modelling, specifically for image, tabular and graph data. Most multimodal deep learning approaches use modality-specific architectures that are often trained separately and cannot capture the crucial cross-modal information that motivates the integration of different data sources. This paper presents the Hybrid Early-fusion Attention Learning Network (HEALNet): a flexible multimodal fusion architecture, which a) preserves modality-specific structural information, b) captures the cross-modal interactions and structural information in a shared latent space, c) can effectively handle missing modalities during training and inference, and d) enables intuitive model inspection by learning on the raw data input instead of opaque embeddings. We conduct multimodal survival analysis on Whole Slide Images and Multi-omic data on four cancer datasets from The Cancer Genome Atlas (TCGA). HEALNet achieves state-of-the-art performance compared to other end-to-end trained fusion models, substantially improving over unimodal and multimodal baselines whilst being robust in scenarios with missing modalities.

HEALNet: Multimodal Fusion for Heterogeneous Biomedical Data

TL;DR

A flexible multimodal fusion architecture that preserves modality-specific structural information, can effectively handle missing modalities during training and inference, and enables intuitive model inspection by learning on the raw data input instead of opaque embeddings is presented.

Abstract

Technological advances in medical data collection, such as high-throughput genomic sequencing and digital high-resolution histopathology, have contributed to the rising requirement for multimodal biomedical modelling, specifically for image, tabular and graph data. Most multimodal deep learning approaches use modality-specific architectures that are often trained separately and cannot capture the crucial cross-modal information that motivates the integration of different data sources. This paper presents the Hybrid Early-fusion Attention Learning Network (HEALNet): a flexible multimodal fusion architecture, which a) preserves modality-specific structural information, b) captures the cross-modal interactions and structural information in a shared latent space, c) can effectively handle missing modalities during training and inference, and d) enables intuitive model inspection by learning on the raw data input instead of opaque embeddings. We conduct multimodal survival analysis on Whole Slide Images and Multi-omic data on four cancer datasets from The Cancer Genome Atlas (TCGA). HEALNet achieves state-of-the-art performance compared to other end-to-end trained fusion models, substantially improving over unimodal and multimodal baselines whilst being robust in scenarios with missing modalities.
Paper Structure (12 sections, 4 equations, 4 figures, 6 tables, 2 algorithms)

This paper contains 12 sections, 4 equations, 4 figures, 6 tables, 2 algorithms.

Figures (4)

  • Figure 1: Overview of HEALNet (Hybrid Early-fusion Attention Learning Network) using a shared and modality-specific parameter space to learn from structurally different data sources in the same model (Fig. 1A). The shared space is a learned latent embedding $S$ that is iteratively updated through $d$ attention-based fusion layers and captures the shared information between modalities. The hybrid early-fusion layer (Fig. 1B, and Eq. \ref{['eq:update_function']}) learns the cross-attention weights $W_m=\{W^{(q)}_m, W^{(k)}_m, W^{(v)}_m\}$ for each modality $m$ corresponding to the queries ($Q_m=W^{(q)}_m S$), keys ($K_m=W^{(k)}_m X_m$), and values ($V_m=W^{(v)}_m X_m$) which are shared between layers. These layers capture the structural information of each modality and encode it in the shared embedding after a pass through a self-normalising network (SNN) layer.
  • Figure 2: Mean percentage uplift of all multimodal models compared to the best unimodal baseline. Across all tested TCGA cancer sites, HEALNet's hybrid early-fusion paradigm outperforms early, intermediate, and late fusion methods.
  • Figure 3: Illustration of model's inspection capabilities using HEALNet on a high-risk patient of the UCEC study. We use the mean modality-specific attention weights across layers to highlight high-risk regions and inspect high-attention omic features. Individual patches can be used for further clinical or computational post-hoc analysis such as nucleus segmentation. We observe that the high-risk regions exhibit a very high concentration and different arrangement of epithelial cells (red) which is commonly associated with the origin of various cancer types coradini_Epithelial_2011Epithelialcellpolarityandtumorigenesis.
  • Figure 4: Effect of the regularisation mechanism. We show the train (top) and validation (bottom) losses on the KIRP dataset, of HEALNet variants with no regularisation (blue), only L1 regularisation (indigo), and L1 regularisation + a self-normalising network layer (green).