Table of Contents
Fetching ...

Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing

Yoonjeon Kim, Soohyun Ryu, Yeonsung Jung, Hyunkoo Lee, Joowon Kim, June Yong Yang, Jaeryong Hwang, Eunho Yang

TL;DR

AugCLIP tackles the context-blindness problem in evaluating text-guided image edits by introducing a context-aware metric that balances preservation and modification. It uses a multi-modal language model to extract source and target attributes, embeds them in CLIP space, and learns a separating hyperplane to define an ideal, minimally modified edit via a vector v. The score compares the edited image to this ideal representation, preserving core source content while aligning with the target text, and is shown to correlate strongly with human judgments across diverse datasets, including personalized generation scenarios. This approach provides a robust, scalable tool for evaluating text-guided edits with improved reliability over existing metrics, enabling more consistent comparisons of editing methods and guiding model development.

Abstract

The development of vision-language and generative models has significantly advanced text-guided image editing, which seeks the preservation of core elements in the source image while implementing modifications based on the target text. However, existing metrics have a context-blindness problem, indiscriminately applying the same evaluation criteria on completely different pairs of source image and target text, biasing towards either modification or preservation. Directional CLIP similarity, the only metric that considers both source image and target text, is also biased towards modification aspects and attends to irrelevant editing regions of the image. We propose AugCLIP, a context-aware metric that adaptively coordinates preservation and modification aspects, depending on the specific context of a given source image and target text. This is done by deriving the CLIP representation of an ideally edited image, that preserves the source image with necessary modifications to align with target text. More specifically, using a multi-modal large language model, AugCLIP augments the textual descriptions of the source and target, then calculates a modification vector through a hyperplane that separates source and target attributes in CLIP space. Extensive experiments on five benchmark datasets, encompassing a diverse range of editing scenarios, show that AugCLIP aligns remarkably well with human evaluation standards, outperforming existing metrics. The code is available at https://github.com/augclip/augclip_eval.

Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing

TL;DR

AugCLIP tackles the context-blindness problem in evaluating text-guided image edits by introducing a context-aware metric that balances preservation and modification. It uses a multi-modal language model to extract source and target attributes, embeds them in CLIP space, and learns a separating hyperplane to define an ideal, minimally modified edit via a vector v. The score compares the edited image to this ideal representation, preserving core source content while aligning with the target text, and is shown to correlate strongly with human judgments across diverse datasets, including personalized generation scenarios. This approach provides a robust, scalable tool for evaluating text-guided edits with improved reliability over existing metrics, enabling more consistent comparisons of editing methods and guiding model development.

Abstract

The development of vision-language and generative models has significantly advanced text-guided image editing, which seeks the preservation of core elements in the source image while implementing modifications based on the target text. However, existing metrics have a context-blindness problem, indiscriminately applying the same evaluation criteria on completely different pairs of source image and target text, biasing towards either modification or preservation. Directional CLIP similarity, the only metric that considers both source image and target text, is also biased towards modification aspects and attends to irrelevant editing regions of the image. We propose AugCLIP, a context-aware metric that adaptively coordinates preservation and modification aspects, depending on the specific context of a given source image and target text. This is done by deriving the CLIP representation of an ideally edited image, that preserves the source image with necessary modifications to align with target text. More specifically, using a multi-modal large language model, AugCLIP augments the textual descriptions of the source and target, then calculates a modification vector through a hyperplane that separates source and target attributes in CLIP space. Extensive experiments on five benchmark datasets, encompassing a diverse range of editing scenarios, show that AugCLIP aligns remarkably well with human evaluation standards, outperforming existing metrics. The code is available at https://github.com/augclip/augclip_eval.

Paper Structure

This paper contains 68 sections, 17 equations, 22 figures, 14 tables.

Figures (22)

  • Figure 1: Context Blindness Problem of Existing Evaluation Metrics. Evaluation metrics should consider the specific context of a given source image and target text. However, existing metrics exhibit context-blindness, applying the same criteria of either 'preserve' (P) or 'modify' (M) across the entire image. Our proposed metric, AugCLIP, is a context-aware metric that flexibly applies different criteria to local regions of the image.
  • Figure 2: Combination of Preservation- and Modification-Centric Metrics Deteriorates in Performance. The plot shows the human alignment score ${\bm{s}}_\text{align}$ measured by a linear interpolation of {DINO, SC, CLIP-I} and CLIP-T. The results show that combining rather degrades the alignment with human judgment.
  • Figure 3: Problems of Directional CLIP Similarity. (a) CLIP$_\text{dir}$ assigns higher scores to excessive modification, over well-edited ground truth images. (b) CLIP$_\text{dir}$ evaluates edited images by attending to irrelevant regions of the image. Adding visual annotations helps $\mathrm{CLIP}_\text{dir}$ properly attend to edited regions.
  • Figure 4: (a) Description Extraction Process. The source image describes a young child standing on the balance board. Target text guides the editing model to make the girl sit. The source and target attributes are extracted with MLLM. (b) Evaluation Process of AugCLIP. The two edited images demonstrate i) An older woman sitting down with legs crossed ($I_\text{edit}^1$) and ii) A young girl sitting on the floor ($I_\text{edit}^2$). AugCLIP derives the ideal image representation as a minimum modification $\mathbf v$ on the source image to be classified as target. The second image that is closer to $I_\text{src} + \mathbf v$ shows a higher score, while the first image that is excessively modified with lost source identity shows a lower score.
  • Figure 5: Difference between $\mathrm{CLIP}_\text{dir}$ and AugCLIP. The red line indicates the evaluation standard and the black line indicates the change in image from source to target. Both $\mathrm{CLIP}_\text{dir}$ and AugCLIP measure the quality of the edited image according to the corresponding red lines. In (b), the green and yellow circles indicate the distribution of target and source attributes, respectively, and dotted black lines indicate the evaluation standard of $\mathrm{CLIP}_\text{dir}$.
  • ...and 17 more figures