Table of Contents
Fetching ...

IterInv: Iterative Inversion for Pixel-Level T2I Models

Chuanming Tang, Kai Wang, Joost van de Weijer

TL;DR

The paper tackles the challenge of reconstructing and editing real images with pixel-level T2I diffusion models, where traditional DDIM inversion fails in cascaded pipelines. It introduces IterInv, an iterative inversion framework that uses NTI for low-resolution stages and inner-iteration optimization for higher-resolution stages to recover the original image with high fidelity. Evaluated on the DeepFloyd-IF pipeline, IterInv delivers superior inversion accuracy over DDIM and competitive gains against latent-space baselines, while enabling pixel-level editing through DiffEdit. The work demonstrates a practical route to reliable, controllable image editing in pixel-space diffusion, with code to be released upon acceptance.

Abstract

Large-scale text-to-image diffusion models have been a ground-breaking development in generating convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques predominantly hinge on DDIM inversion as a prevalent practice rooted in Latent Diffusion Models (LDM). However, the large pretrained T2I models working on the latent space suffer from losing details due to the first compression stage with an autoencoder mechanism. Instead, other mainstream T2I pipeline working on the pixel level, such as Imagen and DeepFloyd-IF, circumvents the above problem. They are commonly composed of multiple stages, typically starting with a text-to-image stage and followed by several super-resolution stages. In this pipeline, the DDIM inversion fails to find the initial noise and generate the original image given that the super-resolution diffusion models are not compatible with the DDIM technique. According to our experimental findings, iteratively concatenating the noisy image as the condition is the root of this problem. Based on this observation, we develop an iterative inversion (IterInv) technique for this category of T2I models and verify IterInv with the open-source DeepFloyd-IF model.Specifically, IterInv employ NTI as the inversion and reconstruction of low-resolution image generation. In stages 2 and 3, we update the latent variance at each timestep to find the deterministic inversion trace and promote the reconstruction process. By combining our method with a popular image editing method, we prove the application prospects of IterInv. The code will be released upon acceptance. The code is available at \url{https://github.com/Tchuanm/IterInv.git}.

IterInv: Iterative Inversion for Pixel-Level T2I Models

TL;DR

The paper tackles the challenge of reconstructing and editing real images with pixel-level T2I diffusion models, where traditional DDIM inversion fails in cascaded pipelines. It introduces IterInv, an iterative inversion framework that uses NTI for low-resolution stages and inner-iteration optimization for higher-resolution stages to recover the original image with high fidelity. Evaluated on the DeepFloyd-IF pipeline, IterInv delivers superior inversion accuracy over DDIM and competitive gains against latent-space baselines, while enabling pixel-level editing through DiffEdit. The work demonstrates a practical route to reliable, controllable image editing in pixel-space diffusion, with code to be released upon acceptance.

Abstract

Large-scale text-to-image diffusion models have been a ground-breaking development in generating convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques predominantly hinge on DDIM inversion as a prevalent practice rooted in Latent Diffusion Models (LDM). However, the large pretrained T2I models working on the latent space suffer from losing details due to the first compression stage with an autoencoder mechanism. Instead, other mainstream T2I pipeline working on the pixel level, such as Imagen and DeepFloyd-IF, circumvents the above problem. They are commonly composed of multiple stages, typically starting with a text-to-image stage and followed by several super-resolution stages. In this pipeline, the DDIM inversion fails to find the initial noise and generate the original image given that the super-resolution diffusion models are not compatible with the DDIM technique. According to our experimental findings, iteratively concatenating the noisy image as the condition is the root of this problem. Based on this observation, we develop an iterative inversion (IterInv) technique for this category of T2I models and verify IterInv with the open-source DeepFloyd-IF model.Specifically, IterInv employ NTI as the inversion and reconstruction of low-resolution image generation. In stages 2 and 3, we update the latent variance at each timestep to find the deterministic inversion trace and promote the reconstruction process. By combining our method with a popular image editing method, we prove the application prospects of IterInv. The code will be released upon acceptance. The code is available at \url{https://github.com/Tchuanm/IterInv.git}.
Paper Structure (10 sections, 6 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 10 sections, 6 equations, 2 figures, 1 table, 1 algorithm.

Figures (2)

  • Figure 1: The network of DeepFloyd-IF pipeline and proposed IterInv inversion technology.
  • Figure 2: Visualization comparison of various inversion means and our editing results.