Table of Contents
Fetching ...

TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data

Siyi Du, Shaoming Zheng, Yinsong Wang, Wenjia Bai, Declan P. O'Regan, Chen Qin

TL;DR

TIP introduces a transformer-based tabular-image pre-training framework that explicitly handles incomplete tabular data through a versatile tabular encoder and a cross-modal interaction module. It optimizes three self-supervised objectives—image-tabular contrastive learning, image-tabular matching, and masked tabular reconstruction—to learn robust multimodal representations under missing data. Evaluated on UK Biobank cardiac data and the DVM car dataset, TIP achieves state-of-the-art performance in both complete and incomplete data regimes, outperforming MMCL and other baselines, with strong robustness to various missingness patterns. The work demonstrates that jointly leveraging image and tabular information via SSL can significantly enhance downstream multimodal classification tasks in real-world settings.

Abstract

Images and structured tables are essential parts of real-world databases. Though tabular-image representation learning is promising to create new insights, it remains a challenging task, as tabular data is typically heterogeneous and incomplete, presenting significant modality disparities with images. Earlier works have mainly focused on simple modality fusion strategies in complete data scenarios, without considering the missing data issue, and thus are limited in practice. In this paper, we propose TIP, a novel tabular-image pre-training framework for learning multimodal representations robust to incomplete tabular data. Specifically, TIP investigates a novel self-supervised learning (SSL) strategy, including a masked tabular reconstruction task for tackling data missingness, and image-tabular matching and contrastive learning objectives to capture multimodal information. Moreover, TIP proposes a versatile tabular encoder tailored for incomplete, heterogeneous tabular data and a multimodal interaction module for inter-modality representation learning. Experiments are performed on downstream multimodal classification tasks using both natural and medical image datasets. The results show that TIP outperforms state-of-the-art supervised/SSL image/multimodal algorithms in both complete and incomplete data scenarios. Our code is available at https://github.com/siyi-wind/TIP.

TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data

TL;DR

TIP introduces a transformer-based tabular-image pre-training framework that explicitly handles incomplete tabular data through a versatile tabular encoder and a cross-modal interaction module. It optimizes three self-supervised objectives—image-tabular contrastive learning, image-tabular matching, and masked tabular reconstruction—to learn robust multimodal representations under missing data. Evaluated on UK Biobank cardiac data and the DVM car dataset, TIP achieves state-of-the-art performance in both complete and incomplete data regimes, outperforming MMCL and other baselines, with strong robustness to various missingness patterns. The work demonstrates that jointly leveraging image and tabular information via SSL can significantly enhance downstream multimodal classification tasks in real-world settings.

Abstract

Images and structured tables are essential parts of real-world databases. Though tabular-image representation learning is promising to create new insights, it remains a challenging task, as tabular data is typically heterogeneous and incomplete, presenting significant modality disparities with images. Earlier works have mainly focused on simple modality fusion strategies in complete data scenarios, without considering the missing data issue, and thus are limited in practice. In this paper, we propose TIP, a novel tabular-image pre-training framework for learning multimodal representations robust to incomplete tabular data. Specifically, TIP investigates a novel self-supervised learning (SSL) strategy, including a masked tabular reconstruction task for tackling data missingness, and image-tabular matching and contrastive learning objectives to capture multimodal information. Moreover, TIP proposes a versatile tabular encoder tailored for incomplete, heterogeneous tabular data and a multimodal interaction module for inter-modality representation learning. Experiments are performed on downstream multimodal classification tasks using both natural and medical image datasets. The results show that TIP outperforms state-of-the-art supervised/SSL image/multimodal algorithms in both complete and incomplete data scenarios. Our code is available at https://github.com/siyi-wind/TIP.
Paper Structure (10 sections, 3 equations, 12 figures, 10 tables)

This paper contains 10 sections, 3 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: The pipeline of TIP, which is pre-trained on large multimodal datasets (a) and can be deployed to downstream tasks with data missingness (b), e.g., Value Missingness (red) and Feature Missingness (yellow). Results for coronary artery disease classification (c) show TIP's superior performance over the SOTA multimodal pre-training method (numbers denote performance increase). Complete results in \ref{['fig:missingness']}.
  • Figure 2: Model architecture and algorithm of TIP: (a) Model overview with its image encoder, tabular encoder, and multimodal interaction module, which are pre-trained using 3 SSL losses: $\mathcal{L}_{itc}$, $\mathcal{L}_{itm}$, and $\mathcal{L}_{mtr}$. (b) Model details for (b-1) $\mathcal{L}_{itm}$ and $\mathcal{L}_{mtr}$ calculation and (b-2) tabular embedding with missing data. (c) Pre-training algorithm.
  • Figure 3: Result comparison with supervised/SSL image/multimodal methods on various number of fine-tuning samples. * denotes fully fine-tuning, and [regular] means linear probing.
  • Figure 4: Results of 4 missing scenarios: (a) RVM, (b) RFM, (c) MIFM, and (d) LIFM, on DVM (1st row), CAD (2nd row), and Infarction (3rd row) tasks with different missing rates $\sigma$. * denotes fully fine-tuning, and [regular] means linear probing.
  • Figure 5: The [CLS] token's attention scores to tabular features for a particular class from the last layer of TIP' tabular encoder.
  • ...and 7 more figures