Table of Contents
Fetching ...

A Modular End-to-End Multimodal Learning Method for Structured and Unstructured Data

Marco D Alessandro, Enrique Calabrés, Mikel Elkano

TL;DR

This work proposes a modular, end-to-end multimodal learning method called MAGNUM, which can natively handle both structured and unstructured data and is flexible enough to employ any specialized unimodal module to extract, compress, and fuse information from all available modalities.

Abstract

Multimodal learning is a rapidly growing research field that has revolutionized multitasking and generative modeling in AI. While much of the research has focused on dealing with unstructured data (e.g., language, images, audio, or video), structured data (e.g., tabular data, time series, or signals) has received less attention. However, many industry-relevant use cases involve or can be benefited from both types of data. In this work, we propose a modular, end-to-end multimodal learning method called MAGNUM, which can natively handle both structured and unstructured data. MAGNUM is flexible enough to employ any specialized unimodal module to extract, compress, and fuse information from all available modalities.

A Modular End-to-End Multimodal Learning Method for Structured and Unstructured Data

TL;DR

This work proposes a modular, end-to-end multimodal learning method called MAGNUM, which can natively handle both structured and unstructured data and is flexible enough to employ any specialized unimodal module to extract, compress, and fuse information from all available modalities.

Abstract

Multimodal learning is a rapidly growing research field that has revolutionized multitasking and generative modeling in AI. While much of the research has focused on dealing with unstructured data (e.g., language, images, audio, or video), structured data (e.g., tabular data, time series, or signals) has received less attention. However, many industry-relevant use cases involve or can be benefited from both types of data. In this work, we propose a modular, end-to-end multimodal learning method called MAGNUM, which can natively handle both structured and unstructured data. MAGNUM is flexible enough to employ any specialized unimodal module to extract, compress, and fuse information from all available modalities.
Paper Structure (13 sections, 6 equations, 1 figure, 1 table)

This paper contains 13 sections, 6 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: A schematic view depicting the MAGNUM end-to-end pipeline. In the low-level module, unstructured data (i.e. text, image) are processed through transformer encoders, and structured data (i.e. tabular) through feature-level FFNs. In both cases, a set of hidden states is obtained for every modality. In the mid-level module, the hidden states go through three GNN-based steps in order to obtain a smaller set of hidden states. These are processed by a Multimodal Gated Fusion (MGF) layer in the high-level module. A final output hidden state is obtained through aggregation.