Table of Contents
Fetching ...

LANISTR: Multimodal Learning from Structured and Unstructured Data

Sayna Ebrahimi, Sercan O. Arik, Yihe Dong, Tomas Pfister

TL;DR

The paper tackles multimodal learning when structured data (tabular/time-series) coexists with unstructured data (language, image) and where modalities are frequently missing. It introduces LANISTR, a framework with modality-specific encoders and a cross-attention fusion module pretrained using unimodal masking losses plus a novel similarity-based multimodal masking loss that handles missing modalities. LANISTR demonstrates substantial improvements on real-world datasets (MIMIC-IV and Amazon Review), achieving up to AUROC of $87.37\%$ with minimal labeled data and proving robust to incomplete modality availability. The work advances practical multimodal learning by leveraging large-scale unlabeled data and providing strong, transferable representations for healthcare and retail tasks, with code and models to be released.

Abstract

Multimodal large-scale pretraining has shown impressive performance for unstructured data such as language and image. However, a prevalent real-world scenario involves structured data types, tabular and time-series, along with unstructured data. Such scenarios have been understudied. To bridge this gap, we propose LANISTR, an attention-based framework to learn from LANguage, Image, and STRuctured data. The core of LANISTR's methodology is rooted in \textit{masking-based} training applied across both unimodal and multimodal levels. In particular, we introduce a new similarity-based multimodal masking loss that enables it to learn cross-modal relations from large-scale multimodal data with missing modalities. On two real-world datasets, MIMIC-IV (from healthcare) and Amazon Product Review (from retail), LANISTR demonstrates remarkable improvements, 6.6\% (in AUROC) and 14\% (in accuracy) when fine-tuned with 0.1\% and 0.01\% of labeled data, respectively, compared to the state-of-the-art alternatives. Notably, these improvements are observed even with very high ratio of samples (35.7\% and 99.8\% respectively) not containing all modalities, underlining the robustness of LANISTR to practical missing modality challenge. Our code and models will be available at https://github.com/google-research/lanistr

LANISTR: Multimodal Learning from Structured and Unstructured Data

TL;DR

The paper tackles multimodal learning when structured data (tabular/time-series) coexists with unstructured data (language, image) and where modalities are frequently missing. It introduces LANISTR, a framework with modality-specific encoders and a cross-attention fusion module pretrained using unimodal masking losses plus a novel similarity-based multimodal masking loss that handles missing modalities. LANISTR demonstrates substantial improvements on real-world datasets (MIMIC-IV and Amazon Review), achieving up to AUROC of with minimal labeled data and proving robust to incomplete modality availability. The work advances practical multimodal learning by leveraging large-scale unlabeled data and providing strong, transferable representations for healthcare and retail tasks, with code and models to be released.

Abstract

Multimodal large-scale pretraining has shown impressive performance for unstructured data such as language and image. However, a prevalent real-world scenario involves structured data types, tabular and time-series, along with unstructured data. Such scenarios have been understudied. To bridge this gap, we propose LANISTR, an attention-based framework to learn from LANguage, Image, and STRuctured data. The core of LANISTR's methodology is rooted in \textit{masking-based} training applied across both unimodal and multimodal levels. In particular, we introduce a new similarity-based multimodal masking loss that enables it to learn cross-modal relations from large-scale multimodal data with missing modalities. On two real-world datasets, MIMIC-IV (from healthcare) and Amazon Product Review (from retail), LANISTR demonstrates remarkable improvements, 6.6\% (in AUROC) and 14\% (in accuracy) when fine-tuned with 0.1\% and 0.01\% of labeled data, respectively, compared to the state-of-the-art alternatives. Notably, these improvements are observed even with very high ratio of samples (35.7\% and 99.8\% respectively) not containing all modalities, underlining the robustness of LANISTR to practical missing modality challenge. Our code and models will be available at https://github.com/google-research/lanistr
Paper Structure (29 sections, 3 equations, 2 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 3 equations, 2 figures, 7 tables, 1 algorithm.

Figures (2)

  • Figure 1: LANISTR architecture and pretraining objectives. It is composed of modality-specific encoders and a multimodal fusion encoder that combines the concatenated embeddings via cross attention. LANISTR accepts both parallel (with all modalities present) and non-parallel (data with missing modalities) multimodal data samples.
  • Figure 2: Illustration of similarity-based multimodal masking in LANISTR based on the objective defined between the multimodal input and its masked version.