Multimodal Alignment and Fusion: A Survey

Songtao Li; Hao Tang

Multimodal Alignment and Fusion: A Survey

Songtao Li, Hao Tang

TL;DR

This survey tackles the problem of aligning and fusing information across text, images, audio, and video. It introduces a structure-centric, method-driven framework that pairs data-level, feature-level, and output-level fusion with statistical, kernel-based, graphical, generative, contrastive, attention-based, and LLM-based approaches. Drawing on over 260 studies, it identifies key challenges—cross-modal misalignment, modality gaps, data quality, and scalability—and discusses practical applications such as retrieval, emotion recognition, and embodied AI. The analysis aims to guide future research toward scalable, robust, and generalizable multimodal alignment and fusion across diverse domains.

Abstract

This survey provides a comprehensive overview of recent advances in multimodal alignment and fusion within the field of machine learning, driven by the increasing availability and diversity of data modalities such as text, images, audio, and video. Unlike previous surveys that often focus on specific modalities or limited fusion strategies, our work presents a structure-centric and method-driven framework that emphasizes generalizable techniques. We systematically categorize and analyze key approaches to alignment and fusion through both structural perspectives -- data-level, feature-level, and output-level fusion -- and methodological paradigms -- including statistical, kernel-based, graphical, generative, contrastive, attention-based, and large language model (LLM)-based methods, drawing insights from an extensive review of over 260 relevant studies. Furthermore, this survey highlights critical challenges such as cross-modal misalignment, computational bottlenecks, data quality issues, and the modality gap, along with recent efforts to address them. Applications ranging from social media analysis and medical imaging to emotion recognition and embodied AI are explored to illustrate the real-world impact of robust multimodal systems. The insights provided aim to guide future research toward optimizing multimodal learning systems for improved scalability, robustness, and generalizability across diverse domains.

Multimodal Alignment and Fusion: A Survey

TL;DR

Abstract

Multimodal Alignment and Fusion: A Survey

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)