A survey on knowledge-enhanced multimodal learning

Maria Lymperaiou; Giorgos Stamou

A survey on knowledge-enhanced multimodal learning

Maria Lymperaiou, Giorgos Stamou

TL;DR

This survey addresses the integration of external knowledge with visiolinguistic (VL) learning, outlining how knowledge graphs and other sources fill commonsense, factual, and temporal gaps in VL models. It provides a comprehensive taxonomy of knowledge senses, knowledge sources, graph representations, and a spectrum of knowledge-enhanced VL tasks, from VQA and VCR to image captioning and story generation, detailing architectures, datasets, and evaluation metrics. The authors highlight the predominance of transformers and pretraining in VL, discuss explicit versus implicit knowledge, and analyze challenges such as explainability, data quality, and scalability, while outlining future directions including multi-task learning, LM-as-KB, and richer knowledge senses. Overall, the work clarifies how knowledge integration can improve generalization, interpretability, and robustness in VL systems and guides future research across single-task and multi-task knowledge-enhanced models.

Abstract

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. Massive pre-training procedures enable VL models to acquire a certain level of real-world understanding, although many gaps can be identified: the limited comprehension of commonsense, factual, temporal and other everyday knowledge aspects questions the extendability of VL tasks. Knowledge graphs and other knowledge sources can fill those gaps by explicitly providing missing information, unlocking novel capabilities of VL models. In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. The current survey aims to unify the fields of VL representation learning and knowledge graphs, and provides a taxonomy and analysis of knowledge-enhanced VL models.

A survey on knowledge-enhanced multimodal learning

TL;DR

Abstract

Paper Structure (136 sections, 6 equations, 5 figures)

This paper contains 136 sections, 6 equations, 5 figures.

Introduction
Background
Multimodal representation learning
Text representation
Distributed word representations
Recurrent neural networks (RNNs)
Language transformers
Visual representation
Convolutional Neural Networks (CNNs)
Image Transformers
Sequential models for VL tasks
Multimodal Transformers
Special input tokens and embeddings
Vision and Language joint encoding
Double-stream fusion encoder
...and 121 more sections

Figures (5)

Figure 1: A generic overview of KVL implementations. Knowledge can be fused early in the pipeline, contributing to a KVL representation, or later on, modifying the outcome of a VL model.
Figure 2: The overall workflow of a VL transformer.
Figure 3: A general outline of input tokens and embeddings.
Figure 4: Overview of knowledge sources.
Figure 5: A taxonomy of VL tasks with knowledge

A survey on knowledge-enhanced multimodal learning

TL;DR

Abstract

A survey on knowledge-enhanced multimodal learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)