Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment

Zhanbo Feng; Zenan Ling; Xinyu Lu; Ci Gong; Feng Zhou; Wugedele Bao; Jie Li; Fan Yang; Robert C. Qiu

Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment

Zhanbo Feng, Zenan Ling, Xinyu Lu, Ci Gong, Feng Zhou, Wugedele Bao, Jie Li, Fan Yang, Robert C. Qiu

TL;DR

This work tackles the challenge of editing real images with diffusion models by coupling visual references with text guidance in a frozen latent space. The authors introduce Step-Wise Alignment (SWA), a framework that uses four components—Text Encoder, Visual Generator, Attribute Encoder, and Editing Generator—to fuse visual and textual prompts and refine their alignment through an asymmetrical diffusion update guided by a directional CLIP loss. A reconstruction term stabilizes edits, enabling zero-shot attribute manipulation on both in-domain and out-of-domain targets while preserving image quality. Across CelebA-HQ and LSUN datasets, SWA yields high-quality, semantically consistent edits, outperforming prior methods in both realism and controllability, with practical implications for image editing tasks requiring precise semantic control.

Abstract

The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a \textit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.

Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment

TL;DR

Abstract

Paper Structure (11 sections, 5 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 11 sections, 5 equations, 6 figures, 2 tables, 1 algorithm.

Introduction
Methodology
Problem Definition
Framework
Step-Wise Alignment
Image Editing via SWA
Experiments
Editing Consistency
Editing Generalization
Ablation Experiments
Conclusion

Figures (6)

Figure 1: The framework of SWA. The reference image is encoded into features $\Delta h$. Then, $\Delta h$ are integrated into the latent features $h$ of the editing image. The textual prompt contributes semantic information for the manipulation process.
Figure 2: Consistent editing: Once a reference image is provided, our method enables consistent and controllable editing to the reference image. In contrast, NTI fails to generate the corresponding style for the glasses.
Figure 3: Editing results for in-domain attributes
Figure 4: Editing results for out-of-domain attributes.
Figure 5: Editing results of SWA on various datasets.
...and 1 more figures

Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment

TL;DR

Abstract

Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (6)