EditCLIP: Representation Learning for Image Editing
Qian Wang, Aleksandar Cvejic, Abdelrahman Eldesokey, Peter Wonka
TL;DR
EditCLIP introduces a unified edit representation by embedding the transformation from an input image to its edited version into the CLIP space. By learning from input-edited image pairs, it produces an edit embedding that aligns with editing instructions, enabling exemplar-based editing to substitute textual prompts in diffusion models and providing automated evaluation via EC2T and EC2EC metrics. Pre-trained with a CLIP-like objective on concatenated image pairs and freezing the text encoder, EditCLIP supports transferable edits with reduced computation compared to large Vision-Language Model pipelines. Experiments on IP2P-derived data and TOP-Bench-X show competitive, scalable performance with strong alignment to human judgments, highlighting its practical impact for both editing and evaluation in image synthesis pipelines.
Abstract
We introduce EditCLIP, a novel representation-learning approach for image editing. Our method learns a unified representation of edits by jointly encoding an input image and its edited counterpart, effectively capturing their transformation. To evaluate its effectiveness, we employ EditCLIP to solve two tasks: exemplar-based image editing and automated edit evaluation. In exemplar-based image editing, we replace text-based instructions in InstructPix2Pix with EditCLIP embeddings computed from a reference exemplar image pair. Experiments demonstrate that our approach outperforms state-of-the-art methods while being more efficient and versatile. For automated evaluation, EditCLIP assesses image edits by measuring the similarity between the EditCLIP embedding of a given image pair and either a textual editing instruction or the EditCLIP embedding of another reference image pair. Experiments show that EditCLIP aligns more closely with human judgments than existing CLIP-based metrics, providing a reliable measure of edit quality and structural preservation.
