Table of Contents
Fetching ...

Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning

Sherry X. Chen, Misha Sra, Pradeep Sen

TL;DR

Instruct-CLIP (I-CLIP) addresses the data bottleneck in instruction-guided image editing by learning semantic changes between original and edited images and refining edit instructions to better reflect actual edits. It employs a dual-branch contrastive framework that maps visual changes and instructions into a shared space, with a DINOv2 front-end to robustly extract visual features and a DeCap-style decoder to recover refined instructions. The approach extends to latent-diffusion training through LD-DINOv2, enabling robust handling of latent representations and timesteps, and it leverages this to produce a refined IP2P dataset of over 120K samples for fine-tuning. Empirical results show improved alignment with instructions and user preference over prior methods, though limitations remain due to the inherent constraints of the underlying generative models. Overall, I-CLIP provides a scalable, self-supervised route to enhance instruction-guided image editing by aligning semantic changes with textual guidance and refining training data accordingly.

Abstract

Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to the difficulty of creating large, high-quality training datasets. To do this, previous approaches have typically relied on text-to-image (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP (I-CLIP), a selfsupervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel I-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at https://github.com/SherryXTChen/Instruct-CLIP.git.

Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning

TL;DR

Instruct-CLIP (I-CLIP) addresses the data bottleneck in instruction-guided image editing by learning semantic changes between original and edited images and refining edit instructions to better reflect actual edits. It employs a dual-branch contrastive framework that maps visual changes and instructions into a shared space, with a DINOv2 front-end to robustly extract visual features and a DeCap-style decoder to recover refined instructions. The approach extends to latent-diffusion training through LD-DINOv2, enabling robust handling of latent representations and timesteps, and it leverages this to produce a refined IP2P dataset of over 120K samples for fine-tuning. Empirical results show improved alignment with instructions and user preference over prior methods, though limitations remain due to the inherent constraints of the underlying generative models. Overall, I-CLIP provides a scalable, self-supervised route to enhance instruction-guided image editing by aligning semantic changes with textual guidance and refining training data accordingly.

Abstract

Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to the difficulty of creating large, high-quality training datasets. To do this, previous approaches have typically relied on text-to-image (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP (I-CLIP), a selfsupervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel I-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at https://github.com/SherryXTChen/Instruct-CLIP.git.

Paper Structure

This paper contains 22 sections, 16 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Results showcasing the strength of Instruct-CLIP (I-CLIP) compared to state-of-the-art InstructPix2Pix (IP2P) brooks2023instructpix2pix.
  • Figure 2: Problems with existing instruction-guided image-editing datasets brooks2023instructpix2pix. As shown, there are many examples where the dataset's original edit instruction does not match the actual changes in the images. Our I-CLIP approach refines edit instructions to match the visual change better and allows us to train a system that produces better outputs. The values in parentheses are the cosine similarity between the visual change from the original to the edited image and the edit instruction from I-CLIP.
  • Figure 3: Instruct-CLIP architectures.(a) Overview of Instruct-CLIP (I-CLIP), which embeds the visual change in the original/edited images $I^o$ and $I^e$ and the edit instruction $p$ into the same feature space through contrastive loss, $\mathcal{L}_\text{contrast}$ (Eq. \ref{['eq:loss_contrastive']}). To obtain refined instruction $p$ from its I-CLIP embedding $z^\text{txt}{}$, we adopt the same approach in DeCap li2023decap to decode $z^\text{txt}{}$ back to $p$ using cross-entropy loss, $\mathcal{L}_\text{DeCap}$ (Eq. \ref{['eq:loss_decap']}). At inference time, the text decoder takes the embedded visual change from the original to the edited image ($z^\text{vis}{}$) and decodes it to produce a new instruction. Due to the significant cosine similarity gap between $z^\text{vis}{}$ and $z^\text{txt}{}$ even when they are well aligned, directly decoding $z^\text{vis}{}$ leads to suboptimal results. To achieve a representation of $z^\text{vis}{}$ closer to the text features that the instruction decoder learned during training, we compute $(z^\text{vis}{})'$ with Eq. \ref{['eq:decap_inference']} and decode it to obtain the refined instruction $p'$, which is used to improve the dataset. (b) The architecture of image encoder $\text{I-CLIP}{}_\text{vis}{}$ includes two shared-weighted $\text{DINOv2}$oquab2023dinov2 modules in front of a standard $\text{CLIP}_\text{vis}{}$ encoder.
  • Figure 4: Training our $\text{LD-DINOv2}$ model. To use I-CLIP as part of the training objective for Stable Diffusion rombach2022high, it needs to handle noisy latent images. Therefore, we replace the original $\text{DINOv2}$ backbone in \ref{['fig:instructclip_indepth']} with a latent-diffusion version of it we call $\text{LD-DINOv2}$, which takes both the noisy latent image $\tilde{L}_k$ from SD VAE encoding and forward-diffusion (FD) timestep $t_k$. We then train $\text{LD-DINOv2}$ to "ignore" the noise and the latent-space compression and to extract the original $\text{DINOv2}$ features using the training objective $\mathcal{L}_{\text{LD-DINO}_\text{v2}{}}$ (Eq. \ref{['eq:loss_lddinov2']}).
  • Figure 5: Comparison with state-of-the-art approaches for instruction-guided image editing, including HIVE zhang2023hive, Inst-Inpaint (I-Inp) yildirim2023inst, Watch Your Steps (WYS) mirzaei2025watch, ZONE li2024zone, MagicBrush (MagBr) zhang2024magicbrush, InstructPix2Pix (IP2P) brooks2023instructpix2pix showcasing the strength of our approach. CLIP-T value of each output is shown at its top-left corner, with the best value per row underlined. Note that the image with the best CLIP-T score is not necessarily the visually best result, underscoring the deficiencies of conventional metrics (including CLIP-I and DINO-I shown in the supplemental) for measuring the quality of image edits.
  • ...and 9 more figures