Table of Contents
Fetching ...

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

Xincheng Shuai, Henghui Ding, Xingjun Ma, Rongcheng Tu, Yu-Gang Jiang, Dacheng Tao

TL;DR

This survey analyzes multimodal-guided image editing with text-to-image diffusion models, introducing a unified framework that decouples inversion (preserving source content) from editing (applying user-guided changes). It contrasts training-free and training-based approaches, detailing four editing paradigms—attention-based, blending-based, score-based, and optimization-based—and maps them onto a two-branch design space. The work surveys hundreds of methods, discusses 2D-to-video extensions to address temporal inconsistency, and highlights open challenges in content-aware/content-free editing, data requirements, and cross-domain generalization. By organizing methods around a common inversion-editing framework and providing design-space guidance, the paper offers a practical reference for researchers selecting methods for specific multimodal editing tasks and applications.

Abstract

Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. It is widely studied in recent years as a promising and challenging field of Artificial Intelligence Generative Content (AIGC). Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models, which generate images according to text prompts. These models demonstrate remarkable generative capabilities and have become widely used tools for image editing. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs. In this survey, we provide a comprehensive review of multimodal-guided image editing techniques that leverage T2I diffusion models. First, we define the scope of image editing from a holistic perspective and detail various control signals and editing scenarios. We then propose a unified framework to formalize the editing process, categorizing it into two primary algorithm families. This framework offers a design space for users to achieve specific goals. Subsequently, we present an in-depth analysis of each component within this framework, examining the characteristics and applicable scenarios of different combinations. Given that training-based methods learn to directly map the source image to target one under user guidance, we discuss them separately, and introduce injection schemes of source image in different scenarios. Additionally, we review the application of 2D techniques to video editing, highlighting solutions for inter-frame inconsistency. Finally, we discuss open challenges in the field and suggest potential future research directions. We keep tracing related works at https://github.com/xinchengshuai/Awesome-Image-Editing.

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

TL;DR

This survey analyzes multimodal-guided image editing with text-to-image diffusion models, introducing a unified framework that decouples inversion (preserving source content) from editing (applying user-guided changes). It contrasts training-free and training-based approaches, detailing four editing paradigms—attention-based, blending-based, score-based, and optimization-based—and maps them onto a two-branch design space. The work surveys hundreds of methods, discusses 2D-to-video extensions to address temporal inconsistency, and highlights open challenges in content-aware/content-free editing, data requirements, and cross-domain generalization. By organizing methods around a common inversion-editing framework and providing design-space guidance, the paper offers a practical reference for researchers selecting methods for specific multimodal editing tasks and applications.

Abstract

Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. It is widely studied in recent years as a promising and challenging field of Artificial Intelligence Generative Content (AIGC). Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models, which generate images according to text prompts. These models demonstrate remarkable generative capabilities and have become widely used tools for image editing. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs. In this survey, we provide a comprehensive review of multimodal-guided image editing techniques that leverage T2I diffusion models. First, we define the scope of image editing from a holistic perspective and detail various control signals and editing scenarios. We then propose a unified framework to formalize the editing process, categorizing it into two primary algorithm families. This framework offers a design space for users to achieve specific goals. Subsequently, we present an in-depth analysis of each component within this framework, examining the characteristics and applicable scenarios of different combinations. Given that training-based methods learn to directly map the source image to target one under user guidance, we discuss them separately, and introduce injection schemes of source image in different scenarios. Additionally, we review the application of 2D techniques to video editing, highlighting solutions for inter-frame inconsistency. Finally, we discuss open challenges in the field and suggest potential future research directions. We keep tracing related works at https://github.com/xinchengshuai/Awesome-Image-Editing.
Paper Structure (45 sections, 25 equations, 17 figures, 2 tables)

This paper contains 45 sections, 25 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Editing tasks meeting our definition. We categorize editing tasks into content-aware and content-free groups, and enumerate several source-target pairs along with corresponding control signals for each scenario. The sample images are from Inversion-FreeLedits++Paint-by-ExampleP2PCross-Image-AttentionDesignEditDrag-DiffusionInSTControlNetDreamBoothMATTEReversion.
  • Figure 2: Organization of the survey.
  • Figure 3: Unified Framework. We present an example of object addition to illustrate the cooperation of two algorithm families within proposed framework. Inversion algorithm $F_{inv}$ encodes source images $I_s$ into $\Phi_I$, and source prompt $\mathcal{C}_I$ identifies original contents. Editing algorithm $F_{edit}$ employs $\Phi_I$ and guidance set $G$ to infer the edited image $\mathbf{z}_0^e$.
  • Figure 4: Application of unified framework. We represent some studies from different tasks within our framework, like object / attribute manipulation DDSP2PPnPMasaCtrlImagicForgeditPTIDACSINEDiff-EditPfb-diffSEGALedits++Region-AwareCDSPTI, spatial transformation Drag-DiffusionSelf-GuidanceDragon-Diffusion, inpainting Blended-Latent-Diffusion, style change StyleInjection, and customization DreamMatcherCustom-EditPhotoSwapVICODCOPick-and-Draw.
  • Figure 5: Attention-Based Editing. Illustrated methods are P2PMasaCtrl. We use red and green colors to represent source and target prompts respectively. The superscripts $s$ and $t$ denote attention features from source and editing images. $A(\cdot)$ in (b) indicates the computation of attention map.
  • ...and 12 more figures