Table of Contents
Fetching ...

AI-Generated Content (AIGC) for Various Data Modalities: A Survey

Lin Geng Foo, Hossein Rahmani, Jun Liu

TL;DR

This survey comprehensively maps AI-generated content across text, image, video, 3D, and audio modalities, foregrounding machine learning and diffusion-based methods while detailing cross-modality generation (e.g., text-to-image, text-to-3D, text-to-video). It introduces a modality-centric taxonomy that systematically separates single-modality AIGC into unconditional and conditional generations and then organizes cross-modality work by output modality and conditioning input. The paper catalogs representative datasets, benchmarks, and trends, and it discusses core challenges, applications, and future directions, including data availability, privacy, and IP concerns. By presenting standardized comparisons and a unified framework, it aims to guide future research and support practitioners deploying multi-modal AIGC systems with awareness of their capabilities and limitations.

Abstract

AI-generated content (AIGC) methods aim to produce text, images, videos, 3D assets, and other media using AI algorithms. Due to its wide range of applications and the potential of recent works, AIGC developments -- especially in Machine Learning (ML) and Deep Learning (DL) -- have been attracting significant attention, and this survey focuses on comprehensively reviewing such advancements in ML/DL. AIGC methods have been developed for various data modalities, such as image, video, text, 3D shape, 3D scene, 3D human avatar, 3D motion, and audio -- each presenting unique characteristics and challenges. Furthermore, there have been significant developments in cross-modality AIGC methods, where generative methods receive conditioning input in one modality and produce outputs in another. Examples include going from various modalities to image, video, 3D, and audio. This paper provides a comprehensive review of AIGC methods across different data modalities, including both single-modality and cross-modality methods, highlighting the various challenges, representative works, and recent technical directions in each setting. We also survey the representative datasets throughout the modalities, and present comparative results for various modalities. Moreover, we discuss the typical applications of AIGC methods in various domains, challenges, and future research directions.

AI-Generated Content (AIGC) for Various Data Modalities: A Survey

TL;DR

This survey comprehensively maps AI-generated content across text, image, video, 3D, and audio modalities, foregrounding machine learning and diffusion-based methods while detailing cross-modality generation (e.g., text-to-image, text-to-3D, text-to-video). It introduces a modality-centric taxonomy that systematically separates single-modality AIGC into unconditional and conditional generations and then organizes cross-modality work by output modality and conditioning input. The paper catalogs representative datasets, benchmarks, and trends, and it discusses core challenges, applications, and future directions, including data availability, privacy, and IP concerns. By presenting standardized comparisons and a unified framework, it aims to guide future research and support practitioners deploying multi-modal AIGC systems with awareness of their capabilities and limitations.

Abstract

AI-generated content (AIGC) methods aim to produce text, images, videos, 3D assets, and other media using AI algorithms. Due to its wide range of applications and the potential of recent works, AIGC developments -- especially in Machine Learning (ML) and Deep Learning (DL) -- have been attracting significant attention, and this survey focuses on comprehensively reviewing such advancements in ML/DL. AIGC methods have been developed for various data modalities, such as image, video, text, 3D shape, 3D scene, 3D human avatar, 3D motion, and audio -- each presenting unique characteristics and challenges. Furthermore, there have been significant developments in cross-modality AIGC methods, where generative methods receive conditioning input in one modality and produce outputs in another. Examples include going from various modalities to image, video, 3D, and audio. This paper provides a comprehensive review of AIGC methods across different data modalities, including both single-modality and cross-modality methods, highlighting the various challenges, representative works, and recent technical directions in each setting. We also survey the representative datasets throughout the modalities, and present comparative results for various modalities. Moreover, we discuss the typical applications of AIGC methods in various domains, challenges, and future research directions.
Paper Structure (62 sections, 10 figures, 10 tables)

This paper contains 62 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: General trend of the number of papers published regarding a) single-modality and b) cross-modality generation/editing every year over the past five years in six top CV and ML conferences (CVPR, ICCV, ECCV, NeurIPS, ICML, and ICLR), as well as three related AI conferences (AAAI, IJCAI, ECAI). a) There is an increasing trend over the years for AIGC papers published regarding image, video and 3D (shape, human and scene) generation for a single modality. b) There is an observable spike over the last 2 years for papers published regarding cross-modality (text-to-image, text-to-video and text-to-3D) generation.
  • Figure 2: Taxonomy of single-modality AIGC methods in this survey. We organize the taxonomy according to the various generated modalities at the top level (in yellow). Then, each modality is further split into unconditional generation methods and conditional generation methods (in green). Specifically, the unconditional methods we discuss are often the fundamental techniques and architectures (in grey) for generating each modality. For the discussion of certain modalities' unconditional methods (e.g., 3D modalities), we further categorize them according to the different representations or settings of the modality (in blue), before going in-depth into their respective fundamental techniques (in grey) for each representation or setting. To save space, we label them as "Fundamental" in the figure to represent "Fundamental Developments (Techniques and Architectures)". Furthermore, when discussing the conditional methods of each modality, we categorize them according to the different conditioning scenarios and settings (in orange).
  • Figure 3: Taxonomy of cross-modality AIGC methods. At the top level, we organize the taxonomy according to the various generated modalities (i.e., output modalities). Then, for each modality (in yellow), we categorize the works according to the modality of the input conditioning information (in green). Moreover, we further categorize them according to the different conditioning settings (in orange) and representations (in blue).
  • Figure 4: Visualization of the trend of improvement of image generation metrics (FID and NLL) on CIFAR-10 dataset. Each point corresponds to a result reported in Table \ref{['table:image_methods']}, and the points are color-coded according to the type of method.
  • Figure 5: Illustration of various conditional image generation settings. Examples obtained from ho2021classifieryu2018generativesaharia2022palettepan2023draggan.
  • ...and 5 more figures