A Unified Model for Longitudinal Multi-Modal Multi-View Prediction with Missingness

Boqi Chen; Junier Oliva; Marc Niethammer

A Unified Model for Longitudinal Multi-Modal Multi-View Prediction with Missingness

Boqi Chen, Junier Oliva, Marc Niethammer

TL;DR

The paper tackles predicting future clinical outcomes from longitudinal, multi-modal medical records despite missing data across timepoints and views. It introduces a unified model with separate view encoders, a masked attention-based summarizer, and a transformer decoder that can process arbitrary input histories without imputation, using learnable [SUM] and [PAD] embeddings and a mask $\ \mathcal{M}$ to handle absent views. Evaluated on the Osteoarthritis Initiative dataset for WOMAC pain and Kellgren-Lawrence grade prediction, the approach achieves competitive performance against view-specific baselines, with gains from longer temporal histories and the ability to accommodate varying view combinations. The work also provides post-hoc analyses of view importance, highlighting knee radiographs and cartilage thickness maps as key contributors for different tasks, and demonstrates the practical impact of flexible, missingness-tolerant, multi-view modeling in real-world clinical data. $OAI$ data handling and the use of a transformer decoder to integrate longitudinal information make the method broadly applicable to other longitudinal multi-modal medical prediction tasks.

Abstract

Medical records often consist of different modalities, such as images, text, and tabular information. Integrating all modalities offers a holistic view of a patient's condition, while analyzing them longitudinally provides a better understanding of disease progression. However, real-world longitudinal medical records present challenges: 1) patients may lack some or all of the data for a specific timepoint, and 2) certain modalities or views might be absent for all patients during a particular period. In this work, we introduce a unified model for longitudinal multi-modal multi-view prediction with missingness. Our method allows as many timepoints as desired for input, and aims to leverage all available data, regardless of their availability. We conduct extensive experiments on the knee osteoarthritis dataset from the Osteoarthritis Initiative for pain and Kellgren-Lawrence grade prediction at a future timepoint. We demonstrate the effectiveness of our method by comparing results from our unified model to specific models that use the same modality and view combinations during training and evaluation. We also show the benefit of having extended temporal data and provide post-hoc analysis for a deeper understanding of each modality/view's importance for different tasks.

A Unified Model for Longitudinal Multi-Modal Multi-View Prediction with Missingness

TL;DR

to handle absent views. Evaluated on the Osteoarthritis Initiative dataset for WOMAC pain and Kellgren-Lawrence grade prediction, the approach achieves competitive performance against view-specific baselines, with gains from longer temporal histories and the ability to accommodate varying view combinations. The work also provides post-hoc analyses of view importance, highlighting knee radiographs and cartilage thickness maps as key contributors for different tasks, and demonstrates the practical impact of flexible, missingness-tolerant, multi-view modeling in real-world clinical data.

data handling and the use of a transformer decoder to integrate longitudinal information make the method broadly applicable to other longitudinal multi-modal medical prediction tasks.

Abstract

Paper Structure (17 sections, 4 equations, 5 figures, 3 tables)

This paper contains 17 sections, 4 equations, 5 figures, 3 tables.

Introduction
Related Works
Method
Feature Extraction
Feature Summarization
Longitudinally-Aware Prediction
Experimental Results
Dataset
Data Preprocessing
Network Training
Results
Conclusion
Acknowledgements
OAI Data
Tabular Data
...and 2 more sections

Figures (5)

Figure 1: Our proposed model consists of an encoder for each modality and view, an attention block for summarizing the features, and a decoder block that predicts the result at each timepoint, focusing solely on previous data. [SUM] and [PAD] are learnable embeddings, where [SUM] outputs the summarized feature of all inputs, and [PAD] represents the modality or view that is absent for all patients.
Figure 2: Comparison of average precision scores between view-specific models and the models obtained via modality dropout from our unified model. The y-axis represents the combination of different views, e.g., TCKP represents using tabular, cartilage thickness maps, knee radiography, and pelvis radiography.
Figure 3: Visualization of the most influential view for pain and KLG prediction. Left: Percentage of data where the view is deemed the most influential. Right: Normalized heatmaps showing the most influential view for each class.
Figure 4: Data availability across modalities and views in the OAI dataset, up to $72m$. The $96m$ data points are excluded as they were not utilized as input. Pelvis data is only available at $0m$ and $48m$.
Figure 5: Comparison between our unified model and view-specific model on AUC ROC. Similar to AP, our unified model performs on par with view-specific models but is slightly better when all views are used.

A Unified Model for Longitudinal Multi-Modal Multi-View Prediction with Missingness

TL;DR

Abstract

A Unified Model for Longitudinal Multi-Modal Multi-View Prediction with Missingness

Authors

TL;DR

Abstract

Table of Contents

Figures (5)