Table of Contents
Fetching ...

LLMs Meet Multimodal Generation and Editing: A Survey

Yingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian, Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou Xing, Wenhai Wang, Jifeng Dai, Yong Zhang, Wei Xue, Qifeng Liu, Yike Guo, Qifeng Chen

TL;DR

The paper surveys how large language models enable generation and editing across image, video, 3D, and audio, detailing the roles of LLMs, CLIP/T5-based approaches, and tool-augmented multimodal agents. It provides a structured comparison of generative model families, multimodal alignment, and MLLMs, then deep-dives into modality-specific generation/editing techniques, datasets, and safety considerations. It highlights the rise of multimodal agents and tool-based workflows, and discusses safety, applications, and future directions toward unified, high-fidelity world models. The work aims to guide researchers and practitioners in building scalable, interactive, and safe generative systems that weave language with vision and sound across multiple modalities.

Abstract

With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. Then, we summarize the various roles of LLMs in multimodal generation and exhaustively investigate the critical technical components behind these methods and the multimodal datasets utilized in these studies. Additionally, we dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction. Lastly, we discuss the advancements in the generative AI safety field, investigate emerging applications, and discuss future prospects. Our work provides a systematic and insightful overview of multimodal generation and processing, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation

LLMs Meet Multimodal Generation and Editing: A Survey

TL;DR

The paper surveys how large language models enable generation and editing across image, video, 3D, and audio, detailing the roles of LLMs, CLIP/T5-based approaches, and tool-augmented multimodal agents. It provides a structured comparison of generative model families, multimodal alignment, and MLLMs, then deep-dives into modality-specific generation/editing techniques, datasets, and safety considerations. It highlights the rise of multimodal agents and tool-based workflows, and discusses safety, applications, and future directions toward unified, high-fidelity world models. The work aims to guide researchers and practitioners in building scalable, interactive, and safe generative systems that weave language with vision and sound across multiple modalities.

Abstract

With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. Then, we summarize the various roles of LLMs in multimodal generation and exhaustively investigate the critical technical components behind these methods and the multimodal datasets utilized in these studies. Additionally, we dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction. Lastly, we discuss the advancements in the generative AI safety field, investigate emerging applications, and discuss future prospects. Our work provides a systematic and insightful overview of multimodal generation and processing, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation
Paper Structure (86 sections, 6 equations, 16 figures, 12 tables)

This paper contains 86 sections, 6 equations, 16 figures, 12 tables.

Figures (16)

  • Figure 1: Our main goal is to investigate the roles of LLMs in the task of language-guided multimodal generation. The modalities we focused on consist of image, video, 3D, and audio (including sound, music, and speech).
  • Figure 2: The illustration of generative models. In this picture, $x$ and $x_0$ indicate the sample from the real data distribution, $x'$ stands for the sample from the model's estimated data distribution, and $z$ means the latent sampled from a prior distribution (typically a Gaussian distribution).
  • Figure 3: History review on the development trajectory of image generation. Early works on image generation predominantly concentrated on synthesizing images within specific narrow domains, such as human faces or bedrooms yu2015lsunliu2018large. Subsequently, DALL-E dalle and Latent Diffusion Models (LDM) ldm have progressed to generate images through user prompts and support the synthesis of open-domain images. In the recent two years, powered by LLMs, research has trended toward achieving a more intuitive and interactive image generation process, such as iterative generation through conversations dong2023dreamllmdalle3.
  • Figure 4: A generic pipeline of integrating image comprehension and generation ability on LLMs sun2023emu2ge2023plantingdong2023dreamllmge2023making. During inference time, users can input interleaved multimodal data (e.g., text and images). The image tokenizer processes the information into image tokens and feeds them into the LLM. LLM outputs image tokens and then decodes them into textual responses and images.
  • Figure 5: Pipeline comparisons of (a) standard text-to-image (T2I) saharia2022photorealisticldm, (b) T2I with LLMs as layout planners feng2023layoutgptqu2023layoutllmchen2023textdiffuser2lian2023llmcho2023visualzhang2023controllablegani2023llm, and (c) T2I with LLMs for layout suggestions jia2023colewu2023self.
  • ...and 11 more figures