Code and Pixels: Multi-Modal Contrastive Pre-training for Enhanced Tabular Data Analysis

Kankana Roy; Lars Krämer; Sebastian Domaschke; Malik Haris; Roland Aydin; Fabian Isensee; Martin Held

Code and Pixels: Multi-Modal Contrastive Pre-training for Enhanced Tabular Data Analysis

Kankana Roy, Lars Krämer, Sebastian Domaschke, Malik Haris, Roland Aydin, Fabian Isensee, Martin Held

TL;DR

This work introduces MT-CMTM, a multi-task, multi-modal pre-training framework that uses Masked Tabular Modeling and image-tabular contrastive learning to enrich a tabular encoder (1D-ResNet-CBAM) while enabling deployment with tabular data alone. By pre-training on paired image+tabular data from HIPMP and DVM, MT-CMTM achieves superior downstream performance in both regression and classification tasks, outperforming strong tabular baselines and maintaining robustness in low-data scenarios. Key contributions include the HIPMP dataset, the 1D-ResNet-CBAM tabular encoder, and comprehensive ablations and explainability analyses that elucidate how multi-task signals improve generalization. The results demonstrate the practical value of leveraging auxiliary imaging information during pre-training to enhance tabular data analysis across domains, with potential applicability to other cost-constrained multimodal settings.

Abstract

Learning from tabular data is of paramount importance, as it complements the conventional analysis of image and video data by providing a rich source of structured information that is often critical for comprehensive understanding and decision-making processes. We present Multi-task Contrastive Masked Tabular Modeling (MT-CMTM), a novel method aiming to enhance tabular models by leveraging the correlation between tabular data and corresponding images. MT-CMTM employs a dual strategy combining contrastive learning with masked tabular modeling, optimizing the synergy between these data modalities. Central to our approach is a 1D Convolutional Neural Network with residual connections and an attention mechanism (1D-ResNet-CBAM), designed to efficiently process tabular data without relying on images. This enables MT-CMTM to handle purely tabular data for downstream tasks, eliminating the need for potentially costly image acquisition and processing. We evaluated MT-CMTM on the DVM car dataset, which is uniquely suited for this particular scenario, and the newly developed HIPMP dataset, which connects membrane fabrication parameters with image data. Our MT-CMTM model outperforms the proposed tabular 1D-ResNet-CBAM, which is trained from scratch, achieving a relative 1.48% improvement in relative MSE on HIPMP and a 2.38% increase in absolute accuracy on DVM. These results demonstrate MT-CMTM's robustness and its potential to advance the field of multi-modal learning.

Code and Pixels: Multi-Modal Contrastive Pre-training for Enhanced Tabular Data Analysis

TL;DR

Abstract

Paper Structure (22 sections, 6 equations, 5 figures, 5 tables)

This paper contains 22 sections, 6 equations, 5 figures, 5 tables.

Introduction
Related Work
Self-supervised Learning in Tabular Data
Multi-modal Contrastive Learning
Multi-task Learning for Self-supervision
Preliminaries
Masked Tabular Modeling (MTM)
Multi-modal Contrastive Learning (MM-CL)
Multi-task Contrastive Masked Tabular Modeling (MT-CMTM)
Experiments and Results
Hereon Isoporous Polymer Membrane Production (HIPMP) Dataset
Data Visual Marketing (DVM) Dataset
Implementation Details
Metrics
Results and Comparison with the Related Work
...and 7 more sections

Figures (5)

Figure 1: Connecting two pretext tasks used in self-supervised tabular pre-training: (a) Masked Tabular Data Modeling (MTM) conducts a mask-and-predict pretext task. (b) Multi-modal contrastive learning (MM-CL) follows a modality comparison paradigm. (c) Multi-task Contrastive Masked Tabular Modeling (MT-CMTM) introduces a combined contrastive and masked pretext scheme.
Figure 2: Incorporating self-supervised pretext tasks and multi-modal contrastive learning in a multi-task framework. Notably, the unimodal tabular encoder (highlighted in red) exhibits substantial performance improvements when shared between two distinct pre-training strategies.
Figure 3: Effect of the number of convolution blocks on the performance of our proposed model 1D-ResNet-CBAM on the tasks of (a) DVM car model prediction from images and (b) membrane quality metric prediction.
Figure 4: Effect of the number of samples on the performance of our proposed MT-CMTM model versus tabular models on the tasks of (a) DVM car model prediction from images and (b) membrane quality metric prediction.
Figure 5: Determining the influence of input features through the application of the SHAP method.

Code and Pixels: Multi-Modal Contrastive Pre-training for Enhanced Tabular Data Analysis

TL;DR

Abstract

Code and Pixels: Multi-Modal Contrastive Pre-training for Enhanced Tabular Data Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (5)