Table of Contents
Fetching ...

Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment

Zhanbo Feng, Zenan Ling, Xinyu Lu, Ci Gong, Feng Zhou, Wugedele Bao, Jie Li, Fan Yang, Robert C. Qiu

TL;DR

This work tackles the challenge of editing real images with diffusion models by coupling visual references with text guidance in a frozen latent space. The authors introduce Step-Wise Alignment (SWA), a framework that uses four components—Text Encoder, Visual Generator, Attribute Encoder, and Editing Generator—to fuse visual and textual prompts and refine their alignment through an asymmetrical diffusion update guided by a directional CLIP loss. A reconstruction term stabilizes edits, enabling zero-shot attribute manipulation on both in-domain and out-of-domain targets while preserving image quality. Across CelebA-HQ and LSUN datasets, SWA yields high-quality, semantically consistent edits, outperforming prior methods in both realism and controllability, with practical implications for image editing tasks requiring precise semantic control.

Abstract

The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a \textit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.

Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment

TL;DR

This work tackles the challenge of editing real images with diffusion models by coupling visual references with text guidance in a frozen latent space. The authors introduce Step-Wise Alignment (SWA), a framework that uses four components—Text Encoder, Visual Generator, Attribute Encoder, and Editing Generator—to fuse visual and textual prompts and refine their alignment through an asymmetrical diffusion update guided by a directional CLIP loss. A reconstruction term stabilizes edits, enabling zero-shot attribute manipulation on both in-domain and out-of-domain targets while preserving image quality. Across CelebA-HQ and LSUN datasets, SWA yields high-quality, semantically consistent edits, outperforming prior methods in both realism and controllability, with practical implications for image editing tasks requiring precise semantic control.

Abstract

The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a \textit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.
Paper Structure (11 sections, 5 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 11 sections, 5 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: The framework of SWA. The reference image is encoded into features $\Delta h$. Then, $\Delta h$ are integrated into the latent features $h$ of the editing image. The textual prompt contributes semantic information for the manipulation process.
  • Figure 2: Consistent editing: Once a reference image is provided, our method enables consistent and controllable editing to the reference image. In contrast, NTI fails to generate the corresponding style for the glasses.
  • Figure 3: Editing results for in-domain attributes
  • Figure 4: Editing results for out-of-domain attributes.
  • Figure 5: Editing results of SWA on various datasets.
  • ...and 1 more figures