Table of Contents
Fetching ...

Multimodal Learning with Transformers: A Survey

Peng Xu, Xiatian Zhu, David A. Clifton

TL;DR

This survey provides a comprehensive overview of Transformer-based multimodal learning, unifying theories and architectures across modalities. It introduces a geometrical/topological viewpoint on self-attention, and proposes a two-dimensional taxonomy by application and design challenges. The paper analyzes vanilla, vision, and multimodal transformers, surveys pretraining paradigms and downstream tasks, and discusses critical issues such as fusion, alignment, efficiency, and transferability. It also outlines open problems and potential directions toward universal, interpretable, and scalable multimodal models with strong cross-modal reasoning and zero-shot capabilities.

Abstract

Transformer is a promising neural network learner, and has achieved great success in various machine learning tasks. Thanks to the recent prevalence of multimodal applications and big data, Transformer-based multimodal learning has become a hot topic in AI research. This paper presents a comprehensive survey of Transformer techniques oriented at multimodal data. The main contents of this survey include: (1) a background of multimodal learning, Transformer ecosystem, and the multimodal big data era, (2) a theoretical review of Vanilla Transformer, Vision Transformer, and multimodal Transformers, from a geometrically topological perspective, (3) a review of multimodal Transformer applications, via two important paradigms, i.e., for multimodal pretraining and for specific multimodal tasks, (4) a summary of the common challenges and designs shared by the multimodal Transformer models and applications, and (5) a discussion of open problems and potential research directions for the community.

Multimodal Learning with Transformers: A Survey

TL;DR

This survey provides a comprehensive overview of Transformer-based multimodal learning, unifying theories and architectures across modalities. It introduces a geometrical/topological viewpoint on self-attention, and proposes a two-dimensional taxonomy by application and design challenges. The paper analyzes vanilla, vision, and multimodal transformers, surveys pretraining paradigms and downstream tasks, and discusses critical issues such as fusion, alignment, efficiency, and transferability. It also outlines open problems and potential directions toward universal, interpretable, and scalable multimodal models with strong cross-modal reasoning and zero-shot capabilities.

Abstract

Transformer is a promising neural network learner, and has achieved great success in various machine learning tasks. Thanks to the recent prevalence of multimodal applications and big data, Transformer-based multimodal learning has become a hot topic in AI research. This paper presents a comprehensive survey of Transformer techniques oriented at multimodal data. The main contents of this survey include: (1) a background of multimodal learning, Transformer ecosystem, and the multimodal big data era, (2) a theoretical review of Vanilla Transformer, Vision Transformer, and multimodal Transformers, from a geometrically topological perspective, (3) a review of multimodal Transformer applications, via two important paradigms, i.e., for multimodal pretraining and for specific multimodal tasks, (4) a summary of the common challenges and designs shared by the multimodal Transformer models and applications, and (5) a discussion of open problems and potential research directions for the community.
Paper Structure (30 sections, 13 equations, 2 figures, 5 tables)

This paper contains 30 sections, 13 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of Transformer vaswani2017attention.
  • Figure 2: Transformer-based cross-modal interactions: (a) Early Summation, (b) Early Concatenation, (c) Hierarchical Attention (multi-stream to one-stream), (d) Hierarchical Attention (one-stream to multi-stream), (e) Cross-Attention, and (f) Cross-Attention to Concatenation. "Q": Query embedding; "K": Key embedding; "V": Value embedding. "TL": Transformer Layer. Best viewed in colour.