Multi-modal Learning with Missing Modality via Shared-Specific Feature Modelling

Hu Wang; Yuanhong Chen; Congbo Ma; Jodie Avery; Louise Hull; Gustavo Carneiro

Multi-modal Learning with Missing Modality via Shared-Specific Feature Modelling

Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, Gustavo Carneiro

TL;DR

ShaSpec tackles missing modalities in multi-modal learning by jointly learning shared and modality-specific representations through distribution alignment and a domain-classification objective, fused via residual connections. The architecture supports both segmentation and classification, and training handles both full and partial modality settings. Empirically, ShaSpec achieves state-of-the-art performance on BraTS2018 segmentation and Audiovision-MNIST classification, with notable gains when modalities are missing and when using dedicated training. The approach is simple, parameter-efficient, and readily adaptable to new tasks and datasets, offering practical impact for robust multimodal inference in real-world scenarios.

Abstract

The missing modality issue is critical but non-trivial to be solved by multi-modal models. Current methods aiming to handle the missing modality problem in multi-modal tasks, either deal with missing modalities only during evaluation or train separate models to handle specific missing modality settings. In addition, these models are designed for specific tasks, so for example, classification models are not easily adapted to segmentation tasks and vice versa. In this paper, we propose the Shared-Specific Feature Modelling (ShaSpec) method that is considerably simpler and more effective than competing approaches that address the issues above. ShaSpec is designed to take advantage of all available input modalities during training and evaluation by learning shared and specific features to better represent the input data. This is achieved from a strategy that relies on auxiliary tasks based on distribution alignment and domain classification, in addition to a residual feature fusion procedure. Also, the design simplicity of ShaSpec enables its easy adaptation to multiple tasks, such as classification and segmentation. Experiments are conducted on both medical image segmentation and computer vision classification, with results indicating that ShaSpec outperforms competing methods by a large margin. For instance, on BraTS2018, ShaSpec improves the SOTA by more than 3% for enhancing tumour, 5% for tumour core and 3% for whole tumour. The code repository address is https://github.com/billhhh/ShaSpec/.

Multi-modal Learning with Missing Modality via Shared-Specific Feature Modelling

TL;DR

Abstract

Paper Structure (18 sections, 9 equations, 5 figures, 4 tables)

This paper contains 18 sections, 9 equations, 5 figures, 4 tables.

Introduction
Related Work
Multi-modal Learning Models
Addressing Missing Modality in Multi-modal Learning
Methodology
Overall Architecture
Evaluation with Full and Missing Modalities
Training with Full and Missing Modalities
Domain Classification Objective
Distribution Alignment Objective
Overall Objective
Experiments
Datasets
Implementation Details
Segmentation Results
...and 3 more sections

Figures (5)

Figure 1: Full-modality training and evaluation of ShaSpec. All modalities $\{\mathbf{x}^{(i)}\}_{i=1}^{N} \in \mathcal{M}$ are passed through one shared encoder and individual specific encoders to produce the shared features $\{\mathbf{r}^{(i)}\}_{i=1}^{N}$ and specific features $\{\mathbf{s}^{(i)}\}_{i=1}^{N}$, respectively. Then, in a residual learning manner, the shared and specific features are fused with a linear projection $f_{\theta^{\text{proj}}}(\cdot)$ to get the fused features $\{\mathbf{f}^{(i)}\}_{i=1}^{N}$ for decoding. The dashed blue arrows indicate different objective functions.
Figure 2: Missing-modality training and evaluation of ShaSpec. Without losing generality, we assume $\mathbf{x}^{(n)}$ is missing, where $n$ can be $1, 2, ..., N$. For available modalities $\mathbf{x}^{(1)}, ..., \mathbf{x}^{(n-1)}, \mathbf{x}^{(n+1)}, ..., \mathbf{x}^{(N)}$, the shared-specific fused features $\mathbf{f}^{(1)}, ..., \mathbf{f}^{(n-1)}, \mathbf{f}^{(n+1)}, ..., \mathbf{f}^{(N)}$ are extracted in the same way as in full modality. But for the missing modality data $\mathbf{x}^{(n)}$, the fused features $\mathbf{f}^{(n)}$ are generated from available shared features $\mathbf{r}^{(1)}, ..., \mathbf{r}^{(n-1)}, \mathbf{r}^{(n+1)}, ..., \mathbf{r}^{(N)}$ via a missing modality feature generation process. The dashed blue arrows indicate different objective functions.
Figure 3: Model performance with varying rates of missing modality data (for both image and audio) on Audiovision-MNIST.
Figure 4: Sensitivity of $\alpha$ and $\beta$ for \ref{['eq:main_loss']} for non-dedicated training, where only T1 is available for evaluation on BraTS2018.
Figure 5: t-SNE visualisation of shared and specific features of four modalities from all training data on BraTS2018. The shared features of four modalities are presented by 'x' in different colours, while the specific features of four modalities are presented by 'o' in different colours.

Multi-modal Learning with Missing Modality via Shared-Specific Feature Modelling

TL;DR

Abstract

Multi-modal Learning with Missing Modality via Shared-Specific Feature Modelling

Authors

TL;DR

Abstract

Table of Contents

Figures (5)