Table of Contents
Fetching ...

Multimodal Alignment and Fusion: A Survey

Songtao Li, Hao Tang

TL;DR

This survey tackles the problem of aligning and fusing information across text, images, audio, and video. It introduces a structure-centric, method-driven framework that pairs data-level, feature-level, and output-level fusion with statistical, kernel-based, graphical, generative, contrastive, attention-based, and LLM-based approaches. Drawing on over 260 studies, it identifies key challenges—cross-modal misalignment, modality gaps, data quality, and scalability—and discusses practical applications such as retrieval, emotion recognition, and embodied AI. The analysis aims to guide future research toward scalable, robust, and generalizable multimodal alignment and fusion across diverse domains.

Abstract

This survey provides a comprehensive overview of recent advances in multimodal alignment and fusion within the field of machine learning, driven by the increasing availability and diversity of data modalities such as text, images, audio, and video. Unlike previous surveys that often focus on specific modalities or limited fusion strategies, our work presents a structure-centric and method-driven framework that emphasizes generalizable techniques. We systematically categorize and analyze key approaches to alignment and fusion through both structural perspectives -- data-level, feature-level, and output-level fusion -- and methodological paradigms -- including statistical, kernel-based, graphical, generative, contrastive, attention-based, and large language model (LLM)-based methods, drawing insights from an extensive review of over 260 relevant studies. Furthermore, this survey highlights critical challenges such as cross-modal misalignment, computational bottlenecks, data quality issues, and the modality gap, along with recent efforts to address them. Applications ranging from social media analysis and medical imaging to emotion recognition and embodied AI are explored to illustrate the real-world impact of robust multimodal systems. The insights provided aim to guide future research toward optimizing multimodal learning systems for improved scalability, robustness, and generalizability across diverse domains.

Multimodal Alignment and Fusion: A Survey

TL;DR

This survey tackles the problem of aligning and fusing information across text, images, audio, and video. It introduces a structure-centric, method-driven framework that pairs data-level, feature-level, and output-level fusion with statistical, kernel-based, graphical, generative, contrastive, attention-based, and LLM-based approaches. Drawing on over 260 studies, it identifies key challenges—cross-modal misalignment, modality gaps, data quality, and scalability—and discusses practical applications such as retrieval, emotion recognition, and embodied AI. The analysis aims to guide future research toward scalable, robust, and generalizable multimodal alignment and fusion across diverse domains.

Abstract

This survey provides a comprehensive overview of recent advances in multimodal alignment and fusion within the field of machine learning, driven by the increasing availability and diversity of data modalities such as text, images, audio, and video. Unlike previous surveys that often focus on specific modalities or limited fusion strategies, our work presents a structure-centric and method-driven framework that emphasizes generalizable techniques. We systematically categorize and analyze key approaches to alignment and fusion through both structural perspectives -- data-level, feature-level, and output-level fusion -- and methodological paradigms -- including statistical, kernel-based, graphical, generative, contrastive, attention-based, and large language model (LLM)-based methods, drawing insights from an extensive review of over 260 relevant studies. Furthermore, this survey highlights critical challenges such as cross-modal misalignment, computational bottlenecks, data quality issues, and the modality gap, along with recent efforts to address them. Applications ranging from social media analysis and medical imaging to emotion recognition and embodied AI are explored to illustrate the real-world impact of robust multimodal systems. The insights provided aim to guide future research toward optimizing multimodal learning systems for improved scalability, robustness, and generalizability across diverse domains.

Paper Structure

This paper contains 31 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of multimodal model architectures: (a) Two-Tower radford_clip_2021jia2021ALIGNliang2024multimodalvasilakis2024instrumentxu2023bridgetowersu2023beyonddu2023touchformerchen2024mixtowerfei2022towardswen2024multimodaltu2022crossmodalyuan2021medication: processes images and text separately, combining embeddings through simple operations (add, multiple, dot product and concatenate); (b) Two-Leg Allaire2012FusingIFBadrinarayanan2015SegNetADdanapal2020sensofusionGuo2023AMFJaiswal2015LearningTCLi2020HierarchicalFFLi2018DenseFuseAFMai2019DivideCAMakris2011AHFMissaoui2010ModelLFRvid2019TowardsRSSteinbaeck2018DesignOAUezato2020GuidedDDWei2021DecisionLevelDFkim_vilt_2021: combines separate image and text embeddings using a Fusion Network; (c) One-Tower bao_vlmo_nodateli_blip_2022Li2023BLIP2chen2023instructblipchen2023instructblip2zhu2023minigpt4enhancingvisionlanguageunderstandingwang2022simvlmwang2024qwen2vlenhancingvisionlanguagemodelsbai2023qwenvlalayrac_flamingo_2022: utilizes a unified network to jointly embed image and text inputs.
  • Figure 2: Three types of fusion with structural perspective: (1) Data-level Fusion: directly combines raw data from multiple modalities; (2) Feature-level Fusion: integrates encoded features from each modality; (3) Output-level Fusion: fuses outputs from individual modality decoders to produce a final result.
  • Figure 3: Canonical Correlation Analysis (CCA), a classic alignment method, aligns different sample matrices with varying feature dimensions using a shared weight matrix to produce a unified representation. $X$ ,$Y$ and $Z$ are the data matrices from three different spaces.
  • Figure 4: In graph-based alignment, different data modalities can form graphs with distinct meanings, where the interpretation of edges and nodes may vary. For example, in Kolar2012GraphAlignment, the interpretation of vertices and edges depends on the type of biological networks being compared.
  • Figure 5: Illustration from Chen2020HGMFHG, demonstrating how graph models can effectively fuse modalities, even when some data is missing.
  • ...and 3 more figures