Table of Contents
Fetching ...

The Curious Case of End Token: A Zero-Shot Disentangled Image Editing using CLIP

Hidir Yesiltepe, Yusuf Dalva, Pinar Yanardag

TL;DR

This work tackles disentangled attribute editing in diffusion-based image synthesis, where editing often entangles multiple regions. It reveals that the CLIP EOS token can serve as a zero-shot editing signal by swapping the EOS embedding of a target attribute into a source embedding, formalized as $sigma(s,g) = [s_{<SOS>:N} | w * g_{<EOS>}]$, enabling training-free edits. Compared with state-of-the-art methods such as SEGA, Ledits++, and Cycle Diffusion, the EOS-based approach yields competitive edit quality and disentanglement, with NSFW moderation demonstrated and a mean-opinion-score study supporting effectiveness. The method is lightweight and widely applicable to image and potentially video editing, offering a practical pathway for rapid, attribute-specific manipulation in diffusion models.

Abstract

Diffusion models have become prominent in creating high-quality images. However, unlike GAN models celebrated for their ability to edit images in a disentangled manner, diffusion-based text-to-image models struggle to achieve the same level of precise attribute manipulation without compromising image coherence. In this paper, CLIP which is often used in popular text-to-image diffusion models such as Stable Diffusion is capable of performing disentangled editing in a zero-shot manner. Through both qualitative and quantitative comparisons with state-of-the-art editing methods, we show that our approach yields competitive results. This insight may open opportunities for applying this method to various tasks, including image and video editing, providing a lightweight and efficient approach for disentangled editing.

The Curious Case of End Token: A Zero-Shot Disentangled Image Editing using CLIP

TL;DR

This work tackles disentangled attribute editing in diffusion-based image synthesis, where editing often entangles multiple regions. It reveals that the CLIP EOS token can serve as a zero-shot editing signal by swapping the EOS embedding of a target attribute into a source embedding, formalized as , enabling training-free edits. Compared with state-of-the-art methods such as SEGA, Ledits++, and Cycle Diffusion, the EOS-based approach yields competitive edit quality and disentanglement, with NSFW moderation demonstrated and a mean-opinion-score study supporting effectiveness. The method is lightweight and widely applicable to image and potentially video editing, offering a practical pathway for rapid, attribute-specific manipulation in diffusion models.

Abstract

Diffusion models have become prominent in creating high-quality images. However, unlike GAN models celebrated for their ability to edit images in a disentangled manner, diffusion-based text-to-image models struggle to achieve the same level of precise attribute manipulation without compromising image coherence. In this paper, CLIP which is often used in popular text-to-image diffusion models such as Stable Diffusion is capable of performing disentangled editing in a zero-shot manner. Through both qualitative and quantitative comparisons with state-of-the-art editing methods, we show that our approach yields competitive results. This insight may open opportunities for applying this method to various tasks, including image and video editing, providing a lightweight and efficient approach for disentangled editing.
Paper Structure (12 sections, 1 equation, 6 figures, 1 table)

This paper contains 12 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Disentagled image editing using <EOS> token. Original images are displayed on the left side, and the edited versions are on the right. All edits are conducted using the <EOS> token related to the respective attribute, such as the <EOS> token of 'A woman with eyeglasses' or 'A man with mustache'.
  • Figure 2: Given a source embedding $s$ of a text prompt such 'A woman' and a target embedding $g$ such as 'A person with an eyeglass', we would like to modify the source embedding $s$ according to $g$ to reflect the corresponding change by replacing <EOS> token of $g$ with source embedding $s$.
  • Figure 3: Target <EOS> Guidance Scale Ablation. We investigate the trade-off between editing quality vs. preservation depending on the target <EOS> guidance scale hyperparameter.
  • Figure 4: Qualitative comparison. We compare the image editing capabilities of <EOS> token with state-of-the-art methods SEGA brack2023sega, Ledits++ brack2023ledits++ and Cycle Diffusion cyclediffusion.
  • Figure 5: Qualitative comparison. (A) We compared our method with a prompt concatenation baseline. We used "a nurse" as the source prompt, and "man, glasses" as the target prompt for generating <EOS> guided image editing. The prompt for the baseline is "a nurse, man, glasses". All images were generated with identical initial noise. High-quality generation was facilitated by Realistic Vision V6. (B) <EOS>-based editing can be used for content moderation. (C) <EOS>-based editing is an effective technique for background editing as well.
  • ...and 1 more figures