Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification
Xin Wang, Yuwei Zhou, Bin Huang, Hong Chen, Wenwu Zhu
TL;DR
This survey addresses the problem of unifying understanding and generation in multi-modal AI. It systematically reviews two dominant paradigms—multi-modal LLMs for understanding and diffusion models for generation—and then discusses design choices for unification, including probabilistic modeling (autoregressive vs diffusion) and architecture (dense vs MoE, single vs dual encoders). The work provides a structured taxonomy of unified models, analyzes practical trading points, and summarizes widely used datasets and future research directions such as video unification, graph-based multi-modal generation, and lightweight deployments. The findings offer a foundation for developing more powerful, efficient, and generalizable multi-modal generative AI capable of both understanding and generating across images, video, and speech.
Abstract
Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry. Particularly, two dominant families of techniques have emerged: i) Multi-modal large language models (LLMs) demonstrate impressive ability for multi-modal understanding; and ii) Diffusion models exhibit remarkable multi-modal powers in terms of multi-modal generation. Therefore, this paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation. To lay a solid foundation for unified models, we first provide a detailed review of both multi-modal LLMs and diffusion models respectively, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video LLMs as well as text-to-image/video generation. Furthermore, we explore the emerging efforts toward unified models for understanding and generation. To achieve the unification of understanding and generation, we investigate key designs including autoregressive-based and diffusion-based modeling, as well as dense and Mixture-of-Experts (MoE) architectures. We then introduce several strategies for unified models, analyzing their potential advantages and disadvantages. In addition, we summarize the common datasets widely used for multi-modal generative AI pretraining. Last but not least, we present several challenging future research directions which may contribute to the ongoing advancement of multi-modal generative AI.
