Table of Contents
Fetching ...

CAMILA: Context-Aware Masking for Image Editing with Language Alignment

Hyunseung Kim, Chiho Choi, Srikanth Malla, Sai Prahladh Padmanabhan, Saurabh Bagchi, Joon Hee Choi

TL;DR

CAMILA addresses the challenge of text-guided image editing when prompts may be infeasible by introducing a context-aware masking mechanism that validates instruction executability against the image. It leverages a Multimodal Large Language Model to produce [MASK] and [NEG] tokens, which are aligned with text via a Token Broadcaster and decoded into precise region masks that guide a diffusion model. The approach achieves superior pixel- and semantic-level alignment across single-, multi-, and context-aware editing settings, supported by newly created context-aware datasets and a surrogate module to further refine masks, as evidenced by improvements in L1/L2, CLIP-I, DINO, CLIP-T, and PickScore metrics. This work advances safe, controllable image editing and highlights directions for tighter diffusion-mask integration and enhanced region-level control.

Abstract

Text-guided image editing has been allowing users to transform and synthesize images through natural language instructions, offering considerable flexibility. However, most existing image editing models naively attempt to follow all user instructions, even if those instructions are inherently infeasible or contradictory, often resulting in nonsensical output. To address these challenges, we propose a context-aware method for image editing named as CAMILA (Context-Aware Masking for Image Editing with Language Alignment). CAMILA is designed to validate the contextual coherence between instructions and the image, ensuring that only relevant edits are applied to the designated regions while ignoring non-executable instructions. For comprehensive evaluation of this new method, we constructed datasets for both single- and multi-instruction image editing, incorporating the presence of infeasible requests. Our method achieves better performance and higher semantic alignment than state-of-the-art models, demonstrating its effectiveness in handling complex instruction challenges while preserving image integrity.

CAMILA: Context-Aware Masking for Image Editing with Language Alignment

TL;DR

CAMILA addresses the challenge of text-guided image editing when prompts may be infeasible by introducing a context-aware masking mechanism that validates instruction executability against the image. It leverages a Multimodal Large Language Model to produce [MASK] and [NEG] tokens, which are aligned with text via a Token Broadcaster and decoded into precise region masks that guide a diffusion model. The approach achieves superior pixel- and semantic-level alignment across single-, multi-, and context-aware editing settings, supported by newly created context-aware datasets and a surrogate module to further refine masks, as evidenced by improvements in L1/L2, CLIP-I, DINO, CLIP-T, and PickScore metrics. This work advances safe, controllable image editing and highlights directions for tighter diffusion-mask integration and enhanced region-level control.

Abstract

Text-guided image editing has been allowing users to transform and synthesize images through natural language instructions, offering considerable flexibility. However, most existing image editing models naively attempt to follow all user instructions, even if those instructions are inherently infeasible or contradictory, often resulting in nonsensical output. To address these challenges, we propose a context-aware method for image editing named as CAMILA (Context-Aware Masking for Image Editing with Language Alignment). CAMILA is designed to validate the contextual coherence between instructions and the image, ensuring that only relevant edits are applied to the designated regions while ignoring non-executable instructions. For comprehensive evaluation of this new method, we constructed datasets for both single- and multi-instruction image editing, incorporating the presence of infeasible requests. Our method achieves better performance and higher semantic alignment than state-of-the-art models, demonstrating its effectiveness in handling complex instruction challenges while preserving image integrity.

Paper Structure

This paper contains 32 sections, 8 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Three scenarios demonstrate how our method handles context-aware multi-instruction editing across various combinations of feasible and infeasible prompts. By leveraging [MASK] and [NEG] specialized tokens, it accurately identifies executable instructions.
  • Figure 2: The architecture of CAMILA begins by jointly processing the image $x_{\text{img}}$ and text instructions $x_{\text{txt}}$ using an MLLM. Output tokens are classified as either [MASK] or [NEG], indicating regions to modify or leave unchanged. These tokens are aligned with the text embeddings using the Token Broadcaster, and the final binary mask is generated by the Token Decoder. The mask is then applied in a diffusion model to produce the edited image.
  • Figure 3: Architecture of the Token Broadcaster. It calculates similarity between MLLM output tokens and encoded text features, assigning each output token to the text embedding that best matches its corresponding semantic region.
  • Figure 4: Qualitative comparisons: FoI needs to extract keywords from each instruction using pretrained GPT model before running the model. Furthermore, due to inaccuracies in the attention map of diffusion model, FoI often fails to make precise modifications. In the case of context-aware instructions, CAMILA accurately identifies applicable instructions by generating [MASK] and [NEG] tokens from MLLM. We present the decoded mask results for each instruction of the [MASK] token.
  • Figure 4: Quantitative comparison on single instruction tasks.CAMILA excels in single instruction tasks by generating precise masks that accurately target modification areas.
  • ...and 5 more figures