A Modular End-to-End Multimodal Learning Method for Structured and Unstructured Data

Marco D Alessandro; Enrique Calabrés; Mikel Elkano

A Modular End-to-End Multimodal Learning Method for Structured and Unstructured Data

Marco D Alessandro, Enrique Calabrés, Mikel Elkano

TL;DR

This work proposes a modular, end-to-end multimodal learning method called MAGNUM, which can natively handle both structured and unstructured data and is flexible enough to employ any specialized unimodal module to extract, compress, and fuse information from all available modalities.

Abstract

Multimodal learning is a rapidly growing research field that has revolutionized multitasking and generative modeling in AI. While much of the research has focused on dealing with unstructured data (e.g., language, images, audio, or video), structured data (e.g., tabular data, time series, or signals) has received less attention. However, many industry-relevant use cases involve or can be benefited from both types of data. In this work, we propose a modular, end-to-end multimodal learning method called MAGNUM, which can natively handle both structured and unstructured data. MAGNUM is flexible enough to employ any specialized unimodal module to extract, compress, and fuse information from all available modalities.

A Modular End-to-End Multimodal Learning Method for Structured and Unstructured Data

TL;DR

Abstract

Paper Structure (13 sections, 6 equations, 1 figure, 1 table)

This paper contains 13 sections, 6 equations, 1 figure, 1 table.

Introduction
Related Work
Parameter-Efficient Learning
Graph Neural Networks
MAGNUM: A Modality-Agnostic Multimodal Modular Architecture
Model architecture
Training objectives
Fine-tuning MAGNUM
Experiments
Evaluation Benchmarks
Implementation Details
Model comparison
Conclusions

Figures (1)

Figure 1: A schematic view depicting the MAGNUM end-to-end pipeline. In the low-level module, unstructured data (i.e. text, image) are processed through transformer encoders, and structured data (i.e. tabular) through feature-level FFNs. In both cases, a set of hidden states is obtained for every modality. In the mid-level module, the hidden states go through three GNN-based steps in order to obtain a smaller set of hidden states. These are processed by a Multimodal Gated Fusion (MGF) layer in the high-level module. A final output hidden state is obtained through aggregation.

A Modular End-to-End Multimodal Learning Method for Structured and Unstructured Data

TL;DR

Abstract

A Modular End-to-End Multimodal Learning Method for Structured and Unstructured Data

Authors

TL;DR

Abstract

Table of Contents

Figures (1)