Controllable Generation with Text-to-Image Diffusion Models: A Survey

Pu Cao; Feng Zhou; Qing Song; Lu Yang

Controllable Generation with Text-to-Image Diffusion Models: A Survey

Pu Cao, Feng Zhou, Qing Song, Lu Yang

TL;DR

Controllable generation with text-to-image diffusion models is surveyed comprehensively, covering theoretical mechanisms for injecting novel conditions into diffusion denoising processes. The paper introduces a condition-centric taxonomy dividing methods into single-condition, multi-condition, and universal approaches, with a deep analysis of conditional score prediction and condition-guided score estimation. It surveys a wide range of techniques for personalization, spatial control, advanced text conditioning, in-context generation, brain and sound guidance, and text rendering, and discusses joint training, continual learning, weight fusion, attention integration, and guidance composition for multi-condition setups. The survey also highlights practical applications in image manipulation, inpainting, composition, and text/image to 3D generation, while identifying challenges and promising directions for the evolving AIGC landscape.

Abstract

In the rapidly advancing realm of visual generation, diffusion models have revolutionized the landscape, marking a significant shift in capabilities with their impressive text-guided generative functions. However, relying solely on text for conditioning these models does not fully cater to the varied and complex requirements of different applications and scenarios. Acknowledging this shortfall, a variety of studies aim to control pre-trained text-to-image (T2I) models to support novel conditions. In this survey, we undertake a thorough review of the literature on controllable generation with T2I diffusion models, covering both the theoretical foundations and practical advancements in this domain. Our review begins with a brief introduction to the basics of denoising diffusion probabilistic models (DDPMs) and widely used T2I diffusion models. We then reveal the controlling mechanisms of diffusion models, theoretically analyzing how novel conditions are introduced into the denoising process for conditional generation. Additionally, we offer a detailed overview of research in this area, organizing it into distinct categories from the condition perspective: generation with specific conditions, generation with multiple conditions, and universal controllable generation. For an exhaustive list of the controllable generation literature surveyed, please refer to our curated repository at \url{https://github.com/PRIV-Creation/Awesome-Controllable-T2I-Diffusion-Models}.

Controllable Generation with Text-to-Image Diffusion Models: A Survey

TL;DR

Abstract

Paper Structure (39 sections, 16 equations, 6 figures, 1 table)

This paper contains 39 sections, 16 equations, 6 figures, 1 table.

Introduction
Preliminaries
Denoising Diffusion Probabilistic Models
Text-to-Image Diffusion Models
Taxonomy
How to Control Text-to-Image Diffusion Models with Novel Conditions
Conditional Score Prediction
Condition-Guided Score Estimation
Controllable Text-to-Image Generation with Specific Conditions
Personalization
Subject-Driven Generation
Person-Driven Generation
Style-Driven Generation
Interaction-Driven Generation
Image-Driven Generation
...and 24 more sections

Figures (6)

Figure 1: An overview of conditional generation with T2I diffusion model. (a) We plot the number of papers on controllable generation based on T2I diffusion models, implying that it is increasing rapidly after powerful generators are released. (b) We present a schematic illustration of controllable generation using the T2I diffusion model, where novel conditions beyond text are introduced to steer the outcomes. Example images are sourced from chen2023photoverse.
Figure 2: Taxonomy of Controllable Generation. From the condition perspective, we categorize controllable generation approaches into three sub-tasks, including generation with specific conditions, generation with multiple conditions, and universal controllable generation.
Figure 3: Illustrations of conditional score prediction mechanisms. .
Figure 4: Illustration of condition-guided score estimation.
Figure 5: Illustration of controllable text-to-image generation with specific conditions. The condition is marked in blue background. Examples are sourced from wei2023elitechen2023photoversechen2023artadapterhuang2023learningramesh2022hierarchicalcao2023conceptzhang2023addingwang2023contextwu2023paragraphlu2023minddiffuserqin2023gluegenzhang2023brush.
...and 1 more figures

Controllable Generation with Text-to-Image Diffusion Models: A Survey

TL;DR

Abstract

Controllable Generation with Text-to-Image Diffusion Models: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (6)