Table of Contents
Fetching ...

TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing

Sherry X. Chen, Yaron Vaxman, Elad Ben Baruch, David Asulin, Aviad Moreshet, Kuo-Chin Lien, Misha Sra, Pradeep Sen

TL;DR

TiNO-Edit tackles the challenge of reliable diffusion-based image editing by optimizing diffusion timesteps and input noise, rather than relying solely on model fine-tuning or prompt manipulation. By operating in Stable Diffusion's latent space and employing LatentCLIP and LatentVGG-based losses, it achieves faster optimization and high-fidelity edits that respect both the original image and the target prompt. The method supports a range of editing styles, including text-guided, reference-guided, stroke-guided, and image composition, and remains compatible with DreamBooth and Textual Inversion concepts. Empirical results demonstrate superior qualitative and quantitative performance across diverse editing tasks, with strong ablations validating the importance of masking, timesteps, and latent-domain losses. This approach offers a practical, scalable workflow for controllable diffusion-based editing with broad applicability in creative and applied contexts.

Abstract

Despite many attempts to leverage pre-trained text-to-image models (T2I) like Stable Diffusion (SD) for controllable image editing, producing good predictable results remains a challenge. Previous approaches have focused on either fine-tuning pre-trained T2I models on specific datasets to generate certain kinds of images (e.g., with a specific object or person), or on optimizing the weights, text prompts, and/or learning features for each input image in an attempt to coax the image generator to produce the desired result. However, these approaches all have shortcomings and fail to produce good results in a predictable and controllable manner. To address this problem, we present TiNO-Edit, an SD-based method that focuses on optimizing the noise patterns and diffusion timesteps during editing, something previously unexplored in the literature. With this simple change, we are able to generate results that both better align with the original images and reflect the desired result. Furthermore, we propose a set of new loss functions that operate in the latent domain of SD, greatly speeding up the optimization when compared to prior approaches, which operate in the pixel domain. Our method can be easily applied to variations of SD including Textual Inversion and DreamBooth that encode new concepts and incorporate them into the edited results. We present a host of image-editing capabilities enabled by our approach. Our code is publicly available at https://github.com/SherryXTChen/TiNO-Edit.

TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing

TL;DR

TiNO-Edit tackles the challenge of reliable diffusion-based image editing by optimizing diffusion timesteps and input noise, rather than relying solely on model fine-tuning or prompt manipulation. By operating in Stable Diffusion's latent space and employing LatentCLIP and LatentVGG-based losses, it achieves faster optimization and high-fidelity edits that respect both the original image and the target prompt. The method supports a range of editing styles, including text-guided, reference-guided, stroke-guided, and image composition, and remains compatible with DreamBooth and Textual Inversion concepts. Empirical results demonstrate superior qualitative and quantitative performance across diverse editing tasks, with strong ablations validating the importance of masking, timesteps, and latent-domain losses. This approach offers a practical, scalable workflow for controllable diffusion-based editing with broad applicability in creative and applied contexts.

Abstract

Despite many attempts to leverage pre-trained text-to-image models (T2I) like Stable Diffusion (SD) for controllable image editing, producing good predictable results remains a challenge. Previous approaches have focused on either fine-tuning pre-trained T2I models on specific datasets to generate certain kinds of images (e.g., with a specific object or person), or on optimizing the weights, text prompts, and/or learning features for each input image in an attempt to coax the image generator to produce the desired result. However, these approaches all have shortcomings and fail to produce good results in a predictable and controllable manner. To address this problem, we present TiNO-Edit, an SD-based method that focuses on optimizing the noise patterns and diffusion timesteps during editing, something previously unexplored in the literature. With this simple change, we are able to generate results that both better align with the original images and reflect the desired result. Furthermore, we propose a set of new loss functions that operate in the latent domain of SD, greatly speeding up the optimization when compared to prior approaches, which operate in the pixel domain. Our method can be easily applied to variations of SD including Textual Inversion and DreamBooth that encode new concepts and incorporate them into the edited results. We present a host of image-editing capabilities enabled by our approach. Our code is publicly available at https://github.com/SherryXTChen/TiNO-Edit.
Paper Structure (19 sections, 8 equations, 18 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 8 equations, 18 figures, 5 tables, 1 algorithm.

Figures (18)

  • Figure 1: Overview of capabilities enabled by TiNO-Edit. TiNO-Edit offers various image-editing capabilities and can be run with DreamBooth (DB) ruiz2023dreambooth or Textual Inversion (TI) gal2022image. By leveraging diffusion timestep and noise optimization techniques, it can generate realistic and high quality outputs.
  • Figure 2: Effect of starting timestep and noise on image editing. Suppose we want to change the cat in the left image to a dog, we can input this image and the target prompt "a photo of a dog" to Stable Diffusion (SD) Img2Img rombach2022highsdimg2img, along with random Gaussian noise $N$ and a starting time $T \in [0,1]$ to produce results such as those shown in the grid on the right. Here, we vary $T$ (fixed per column) and $N$ (fixed per row). As $T$ increases, the output matches the target prompt better, but it also diverges more from the original image in terms of composition and pose. Furthermore, different random noise inputs can lead to different visual features.
  • Figure 3: Optimization parameters. We find optimization parameters by studying the SD denoising process. The output $\tilde{L}_0$ is only affected by the timesteps $t_k$ ($k \in [1,K]$) and the noisy latent image input $\tilde{L}_k$ for each of the $K$ denoising steps. Note we are assuming that the learning models are all fixed (denoted by the snowflake symbol) and that the number of timesteps $K$ is a constant. $\tilde{L}_k$ can then be traced back through $K$ iterations to the initial latent image input $\tilde{L}_K$ that is computed from starting timestep $t_K = T$ and the Gaussian noise $N$. Hence, we can achieve our goal by simply optimizing $N$ and time steps $t_k$ for all $k \in [1,K]$.
  • Figure 4: Training LatentCLIP. Our LatentCLIP visual encoder ($\text{LatentCLIP}_{\text{vis}}$) is a copy of a pre-trained CLIP image encoder ($\text{CLIP}_{\text{vis}}$) radford2021learning, except the first convolution layer is replaced to accommodate for taking the latent vector $\text{VAE}_{ \text{enc}}(I)$esser2021taming as input and output the image feature $f_L$. The entire LatentCLIP visual encoder is unfrozen (indicated by the fire symbol, as opposed to the snowflake symbol which means the model is frozen) and is trained to minimize the cosine difference between $f_L$ and $f$, which is the image feature of $I$ from the CLIP image encoder.
  • Figure 5: Compounded image editing. We present a compounded image-editing workflow by applying our method repeatedly on a single image. For each step, the user can perform any of the supported editing operations. Any additional information such as inputs including masks, reference images, user strokes, user-composed images, and concept images used to train custom concepts are shown next to the corresponding arrows.
  • ...and 13 more figures