Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification

Xin Wang; Yuwei Zhou; Bin Huang; Hong Chen; Wenwu Zhu

Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification

Xin Wang, Yuwei Zhou, Bin Huang, Hong Chen, Wenwu Zhu

TL;DR

This survey addresses the problem of unifying understanding and generation in multi-modal AI. It systematically reviews two dominant paradigms—multi-modal LLMs for understanding and diffusion models for generation—and then discusses design choices for unification, including probabilistic modeling (autoregressive vs diffusion) and architecture (dense vs MoE, single vs dual encoders). The work provides a structured taxonomy of unified models, analyzes practical trading points, and summarizes widely used datasets and future research directions such as video unification, graph-based multi-modal generation, and lightweight deployments. The findings offer a foundation for developing more powerful, efficient, and generalizable multi-modal generative AI capable of both understanding and generating across images, video, and speech.

Abstract

Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry. Particularly, two dominant families of techniques have emerged: i) Multi-modal large language models (LLMs) demonstrate impressive ability for multi-modal understanding; and ii) Diffusion models exhibit remarkable multi-modal powers in terms of multi-modal generation. Therefore, this paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation. To lay a solid foundation for unified models, we first provide a detailed review of both multi-modal LLMs and diffusion models respectively, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video LLMs as well as text-to-image/video generation. Furthermore, we explore the emerging efforts toward unified models for understanding and generation. To achieve the unification of understanding and generation, we investigate key designs including autoregressive-based and diffusion-based modeling, as well as dense and Mixture-of-Experts (MoE) architectures. We then introduce several strategies for unified models, analyzing their potential advantages and disadvantages. In addition, we summarize the common datasets widely used for multi-modal generative AI pretraining. Last but not least, we present several challenging future research directions which may contribute to the ongoing advancement of multi-modal generative AI.

Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification

TL;DR

Abstract

Paper Structure (48 sections, 6 equations, 10 figures, 3 tables)

This paper contains 48 sections, 6 equations, 10 figures, 3 tables.

Introduction
Multi-modal LLM for Understanding
Preliminaries
LLM Autoregressive Probabilistic Modeling
Vision-Language Pretraining
Visual Tokenizer
Multi-modal LLM Architectures
Image LLM
Alignment-Architecture Image LLM
Early-fusion Architecture Image LLM
Challenges in Image LLM
Video LLM
Alignment-Architecture Video LLM
Challenges and Limitations in Video LLM
Speech LLM
...and 33 more sections

Figures (10)

Figure 1: The overall organization of this paper.
Figure 2: Illustration for the framework of the visual tokenizers.
Figure 3: Two branches of multi-modal LLM architectures, including (i) the alignment architecture by aligning pretraining vision models with LLM and (ii) the early-fusion architecture which receives mixed visual and text tokens and relies on autoregressive modeling for multi-modal understanding.
Figure 4: Comparison among GAN, VAE, diffusion, and flow matching models.
Figure 5: Comparison between pixel-level and latent diffusion models.
...and 5 more figures

Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification

TL;DR

Abstract

Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification

Authors

TL;DR

Abstract

Table of Contents

Figures (10)