Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment
Zhanbo Feng, Zenan Ling, Xinyu Lu, Ci Gong, Feng Zhou, Wugedele Bao, Jie Li, Fan Yang, Robert C. Qiu
TL;DR
This work tackles the challenge of editing real images with diffusion models by coupling visual references with text guidance in a frozen latent space. The authors introduce Step-Wise Alignment (SWA), a framework that uses four components—Text Encoder, Visual Generator, Attribute Encoder, and Editing Generator—to fuse visual and textual prompts and refine their alignment through an asymmetrical diffusion update guided by a directional CLIP loss. A reconstruction term stabilizes edits, enabling zero-shot attribute manipulation on both in-domain and out-of-domain targets while preserving image quality. Across CelebA-HQ and LSUN datasets, SWA yields high-quality, semantically consistent edits, outperforming prior methods in both realism and controllability, with practical implications for image editing tasks requiring precise semantic control.
Abstract
The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a \textit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.
