Table of Contents
Fetching ...

EditCLIP: Representation Learning for Image Editing

Qian Wang, Aleksandar Cvejic, Abdelrahman Eldesokey, Peter Wonka

TL;DR

EditCLIP introduces a unified edit representation by embedding the transformation from an input image to its edited version into the CLIP space. By learning from input-edited image pairs, it produces an edit embedding that aligns with editing instructions, enabling exemplar-based editing to substitute textual prompts in diffusion models and providing automated evaluation via EC2T and EC2EC metrics. Pre-trained with a CLIP-like objective on concatenated image pairs and freezing the text encoder, EditCLIP supports transferable edits with reduced computation compared to large Vision-Language Model pipelines. Experiments on IP2P-derived data and TOP-Bench-X show competitive, scalable performance with strong alignment to human judgments, highlighting its practical impact for both editing and evaluation in image synthesis pipelines.

Abstract

We introduce EditCLIP, a novel representation-learning approach for image editing. Our method learns a unified representation of edits by jointly encoding an input image and its edited counterpart, effectively capturing their transformation. To evaluate its effectiveness, we employ EditCLIP to solve two tasks: exemplar-based image editing and automated edit evaluation. In exemplar-based image editing, we replace text-based instructions in InstructPix2Pix with EditCLIP embeddings computed from a reference exemplar image pair. Experiments demonstrate that our approach outperforms state-of-the-art methods while being more efficient and versatile. For automated evaluation, EditCLIP assesses image edits by measuring the similarity between the EditCLIP embedding of a given image pair and either a textual editing instruction or the EditCLIP embedding of another reference image pair. Experiments show that EditCLIP aligns more closely with human judgments than existing CLIP-based metrics, providing a reliable measure of edit quality and structural preservation.

EditCLIP: Representation Learning for Image Editing

TL;DR

EditCLIP introduces a unified edit representation by embedding the transformation from an input image to its edited version into the CLIP space. By learning from input-edited image pairs, it produces an edit embedding that aligns with editing instructions, enabling exemplar-based editing to substitute textual prompts in diffusion models and providing automated evaluation via EC2T and EC2EC metrics. Pre-trained with a CLIP-like objective on concatenated image pairs and freezing the text encoder, EditCLIP supports transferable edits with reduced computation compared to large Vision-Language Model pipelines. Experiments on IP2P-derived data and TOP-Bench-X show competitive, scalable performance with strong alignment to human judgments, highlighting its practical impact for both editing and evaluation in image synthesis pipelines.

Abstract

We introduce EditCLIP, a novel representation-learning approach for image editing. Our method learns a unified representation of edits by jointly encoding an input image and its edited counterpart, effectively capturing their transformation. To evaluate its effectiveness, we employ EditCLIP to solve two tasks: exemplar-based image editing and automated edit evaluation. In exemplar-based image editing, we replace text-based instructions in InstructPix2Pix with EditCLIP embeddings computed from a reference exemplar image pair. Experiments demonstrate that our approach outperforms state-of-the-art methods while being more efficient and versatile. For automated evaluation, EditCLIP assesses image edits by measuring the similarity between the EditCLIP embedding of a given image pair and either a textual editing instruction or the EditCLIP embedding of another reference image pair. Experiments show that EditCLIP aligns more closely with human judgments than existing CLIP-based metrics, providing a reliable measure of edit quality and structural preservation.

Paper Structure

This paper contains 28 sections, 14 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: EditCLIP provides a unified representation of image edits by encoding the transformation between an image and its edited counterpart within the CLIP space. We demonstrate the effectiveness of EditCLIP embeddings in exemplar-based image editing and automated evaluation of image editing pipelines, where it achieves better alignment with human assessment.
  • Figure 2: An overview of our proposed approach. EditCLIP is pre-trained similarly to CLIP, but the visual encoder processes a concatenated exemplar image pair. After pre-training, EditCLIP can replace the text encoder in InstructPix2Pix brooks2023instructpix2pix to enable exemplar-based editing.
  • Figure 3: A visualization of the visual encoder’s attention in EditCLIP compared to the original CLIP. We visualize the attention of the $[CLS]$ token from the last attention head. Unlike CLIP, where attention is dispersed across the image, EditCLIP focuses on the differences between the input and edited image, indicating that it effectively captures the edited regions.
  • Figure 4: Qualitative comparison for exemplar-based image editing.
  • Figure 5: EditCLIP can perform complex edits when the exemplars contain multiple edits in a single step.
  • ...and 10 more figures