Table of Contents
Fetching ...

DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images

Maria Mihaela Trusca, Tinne Tuytelaars, Marie-Francine Moens

TL;DR

DM-Align tackles the challenge of text-guided image editing by making where to edit explicit through one-to-one word alignments between the source caption $c_1$ and the target caption $c_2$, guiding a diffusion-based editing pipeline. It combines word-alignment-based region identification with Grounded-SAM segmentation, a global diffusion mask derived from dual denoising passes conditioned on $c_1$ and $c_2$, and a refinement step followed by inpainting to realize edits. The approach emphasizes background preservation and robustness to long and complex instructions, outperforming several baselines on Dream, Bison, and Imagen with both image-based and text-based metrics, and is supported by ablation and human studies. By providing a transparent, explainable editing process, DM-Align advances controllable image editing with potential impact on content creation pipelines requiring consistent backgrounds and interpretable edit reasoning.

Abstract

Text-based semantic image editing assumes the manipulation of an image using a natural language instruction. Although recent works are capable of generating creative and qualitative images, the problem is still mostly approached as a black box sensitive to generating unexpected outputs. Therefore, we propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve. It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image. The proposed Diffusion Masking with word Alignments (DM-Align) allows the editing of an image in a transparent and explainable way. It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream. When comparing to state-of-the-art baselines, quantitative and qualitative results show that DM-Align has superior performance in image editing conditioned on language instructions, well preserves the background of the image and can better cope with long text instructions.

DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images

TL;DR

DM-Align tackles the challenge of text-guided image editing by making where to edit explicit through one-to-one word alignments between the source caption and the target caption , guiding a diffusion-based editing pipeline. It combines word-alignment-based region identification with Grounded-SAM segmentation, a global diffusion mask derived from dual denoising passes conditioned on and , and a refinement step followed by inpainting to realize edits. The approach emphasizes background preservation and robustness to long and complex instructions, outperforming several baselines on Dream, Bison, and Imagen with both image-based and text-based metrics, and is supported by ablation and human studies. By providing a transparent, explainable editing process, DM-Align advances controllable image editing with potential impact on content creation pipelines requiring consistent backgrounds and interpretable edit reasoning.

Abstract

Text-based semantic image editing assumes the manipulation of an image using a natural language instruction. Although recent works are capable of generating creative and qualitative images, the problem is still mostly approached as a black box sensitive to generating unexpected outputs. Therefore, we propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve. It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image. The proposed Diffusion Masking with word Alignments (DM-Align) allows the editing of an image in a transparent and explainable way. It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream. When comparing to state-of-the-art baselines, quantitative and qualitative results show that DM-Align has superior performance in image editing conditioned on language instructions, well preserves the background of the image and can better cope with long text instructions.
Paper Structure (14 sections, 5 equations, 12 figures, 7 tables)

This paper contains 14 sections, 5 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: The proposed image editor utilizes a source caption to describe the initial image and a target text instruction to define the desired edited image. To accomplish this task, we employ the two captions to generate a diffusion mask, refining it further by incorporating regions of words that we want to keep or alter in the image.
  • Figure 2: The implementation of DM-Align. The aim is to update the input image described by the text instruction $c_1$ ("A clear sky and a ship landed on the sand") according to the text instruction $c_2$ ("A clear sky and a ship landed on the ocean").
  • Figure 3: Semantic image editing: Imagen dataset. Source captions: (1) $c_1$. A photo of a British shorthair cat wearing a cowboy hat and red shirt riding a bike on a beach. (2) $c_1$. An oil painting of a raccoon wearing sunglasses and red shirt playing a guitar on top of a mountain. (3) $c_1$. An oil painting of a fuzzy panda wearing sunglasses and red shirt riding a bike on a beach.
  • Figure 4: Word alignment example. Blue: identical words, Purple: substituted words, Green: nouns with different modifiers, Red: nouns mentioned only in the source caption $c_1$.
  • Figure 5: Semantic image editing: Bison dataset. Source captions: (1) $c_1$. A man standing next to a baby elephant in the city. (2) $c_1$. A wooden plate topped with sliced meat and vegetables. (3) $c_1$. A vase filled with red and white flowers.
  • ...and 7 more figures