DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images

Maria Mihaela Trusca; Tinne Tuytelaars; Marie-Francine Moens

DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images

Maria Mihaela Trusca, Tinne Tuytelaars, Marie-Francine Moens

TL;DR

DM-Align tackles the challenge of text-guided image editing by making where to edit explicit through one-to-one word alignments between the source caption $c_1$ and the target caption $c_2$, guiding a diffusion-based editing pipeline. It combines word-alignment-based region identification with Grounded-SAM segmentation, a global diffusion mask derived from dual denoising passes conditioned on $c_1$ and $c_2$, and a refinement step followed by inpainting to realize edits. The approach emphasizes background preservation and robustness to long and complex instructions, outperforming several baselines on Dream, Bison, and Imagen with both image-based and text-based metrics, and is supported by ablation and human studies. By providing a transparent, explainable editing process, DM-Align advances controllable image editing with potential impact on content creation pipelines requiring consistent backgrounds and interpretable edit reasoning.

Abstract

Text-based semantic image editing assumes the manipulation of an image using a natural language instruction. Although recent works are capable of generating creative and qualitative images, the problem is still mostly approached as a black box sensitive to generating unexpected outputs. Therefore, we propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve. It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image. The proposed Diffusion Masking with word Alignments (DM-Align) allows the editing of an image in a transparent and explainable way. It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream. When comparing to state-of-the-art baselines, quantitative and qualitative results show that DM-Align has superior performance in image editing conditioned on language instructions, well preserves the background of the image and can better cope with long text instructions.

DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images

TL;DR

DM-Align tackles the challenge of text-guided image editing by making where to edit explicit through one-to-one word alignments between the source caption

and the target caption

, guiding a diffusion-based editing pipeline. It combines word-alignment-based region identification with Grounded-SAM segmentation, a global diffusion mask derived from dual denoising passes conditioned on

and

, and a refinement step followed by inpainting to realize edits. The approach emphasizes background preservation and robustness to long and complex instructions, outperforming several baselines on Dream, Bison, and Imagen with both image-based and text-based metrics, and is supported by ablation and human studies. By providing a transparent, explainable editing process, DM-Align advances controllable image editing with potential impact on content creation pipelines requiring consistent backgrounds and interpretable edit reasoning.

Abstract

Paper Structure (14 sections, 5 equations, 12 figures, 7 tables)

This paper contains 14 sections, 5 equations, 12 figures, 7 tables.

Introduction
Related work
Proposed model
Task Definition
Word alignment between the text instructions
Segmentation of the image based on the word alignments
Diffusion mask
Refinement of the diffusion mask
Experimental setup
Results and discussion
Quantitative analysis and ablation tests
Ablation tests
Human qualitative analysis
Conclusion, limitations and future work

Figures (12)

Figure 1: The proposed image editor utilizes a source caption to describe the initial image and a target text instruction to define the desired edited image. To accomplish this task, we employ the two captions to generate a diffusion mask, refining it further by incorporating regions of words that we want to keep or alter in the image.
Figure 2: The implementation of DM-Align. The aim is to update the input image described by the text instruction $c_1$ ("A clear sky and a ship landed on the sand") according to the text instruction $c_2$ ("A clear sky and a ship landed on the ocean").
Figure 3: Semantic image editing: Imagen dataset. Source captions: (1) $c_1$. A photo of a British shorthair cat wearing a cowboy hat and red shirt riding a bike on a beach. (2) $c_1$. An oil painting of a raccoon wearing sunglasses and red shirt playing a guitar on top of a mountain. (3) $c_1$. An oil painting of a fuzzy panda wearing sunglasses and red shirt riding a bike on a beach.
Figure 4: Word alignment example. Blue: identical words, Purple: substituted words, Green: nouns with different modifiers, Red: nouns mentioned only in the source caption $c_1$.
Figure 5: Semantic image editing: Bison dataset. Source captions: (1) $c_1$. A man standing next to a baby elephant in the city. (2) $c_1$. A wooden plate topped with sliced meat and vegetables. (3) $c_1$. A vase filled with red and white flowers.
...and 7 more figures

DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images

TL;DR

Abstract

DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images

Authors

TL;DR

Abstract

Table of Contents

Figures (12)