A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

Xincheng Shuai; Henghui Ding; Xingjun Ma; Rongcheng Tu; Yu-Gang Jiang; Dacheng Tao

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

Xincheng Shuai, Henghui Ding, Xingjun Ma, Rongcheng Tu, Yu-Gang Jiang, Dacheng Tao

TL;DR

This survey analyzes multimodal-guided image editing with text-to-image diffusion models, introducing a unified framework that decouples inversion (preserving source content) from editing (applying user-guided changes). It contrasts training-free and training-based approaches, detailing four editing paradigms—attention-based, blending-based, score-based, and optimization-based—and maps them onto a two-branch design space. The work surveys hundreds of methods, discusses 2D-to-video extensions to address temporal inconsistency, and highlights open challenges in content-aware/content-free editing, data requirements, and cross-domain generalization. By organizing methods around a common inversion-editing framework and providing design-space guidance, the paper offers a practical reference for researchers selecting methods for specific multimodal editing tasks and applications.

Abstract

Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. It is widely studied in recent years as a promising and challenging field of Artificial Intelligence Generative Content (AIGC). Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models, which generate images according to text prompts. These models demonstrate remarkable generative capabilities and have become widely used tools for image editing. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs. In this survey, we provide a comprehensive review of multimodal-guided image editing techniques that leverage T2I diffusion models. First, we define the scope of image editing from a holistic perspective and detail various control signals and editing scenarios. We then propose a unified framework to formalize the editing process, categorizing it into two primary algorithm families. This framework offers a design space for users to achieve specific goals. Subsequently, we present an in-depth analysis of each component within this framework, examining the characteristics and applicable scenarios of different combinations. Given that training-based methods learn to directly map the source image to target one under user guidance, we discuss them separately, and introduce injection schemes of source image in different scenarios. Additionally, we review the application of 2D techniques to video editing, highlighting solutions for inter-frame inconsistency. Finally, we discuss open challenges in the field and suggest potential future research directions. We keep tracing related works at https://github.com/xinchengshuai/Awesome-Image-Editing.

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

TL;DR

Abstract

Paper Structure (45 sections, 25 equations, 17 figures, 2 tables)

This paper contains 45 sections, 25 equations, 17 figures, 2 tables.

Introduction
Preliminaries
Denoising Diffusion Probabilistic Models
Text-to-Image Generation
Notation
Problem Formulation
Definition of Multimodal-Guided Image Editing
Multimodal User Guidance
Editing Scenario
Evaluation of Image Editing
Unified Framework
Inversion Algorithm
Tuning-Based Inversion
Textual Space
Model Space
...and 30 more sections

Figures (17)

Figure 1: Editing tasks meeting our definition. We categorize editing tasks into content-aware and content-free groups, and enumerate several source-target pairs along with corresponding control signals for each scenario. The sample images are from Inversion-FreeLedits++Paint-by-ExampleP2PCross-Image-AttentionDesignEditDrag-DiffusionInSTControlNetDreamBoothMATTEReversion.
Figure 2: Organization of the survey.
Figure 3: Unified Framework. We present an example of object addition to illustrate the cooperation of two algorithm families within proposed framework. Inversion algorithm $F_{inv}$ encodes source images $I_s$ into $\Phi_I$, and source prompt $\mathcal{C}_I$ identifies original contents. Editing algorithm $F_{edit}$ employs $\Phi_I$ and guidance set $G$ to infer the edited image $\mathbf{z}_0^e$.
Figure 4: Application of unified framework. We represent some studies from different tasks within our framework, like object / attribute manipulation DDSP2PPnPMasaCtrlImagicForgeditPTIDACSINEDiff-EditPfb-diffSEGALedits++Region-AwareCDSPTI, spatial transformation Drag-DiffusionSelf-GuidanceDragon-Diffusion, inpainting Blended-Latent-Diffusion, style change StyleInjection, and customization DreamMatcherCustom-EditPhotoSwapVICODCOPick-and-Draw.
Figure 5: Attention-Based Editing. Illustrated methods are P2PMasaCtrl. We use red and green colors to represent source and target prompts respectively. The superscripts $s$ and $t$ denote attention features from source and editing images. $A(\cdot)$ in (b) indicates the computation of attention map.
...and 12 more figures

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

TL;DR

Abstract

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (17)