Table of Contents
Fetching ...

Unsupervised Region-Based Image Editing of Denoising Diffusion Models

Zixiang Li, Yue Song, Renshuai Tao, Xiaohong Jia, Yao Zhao, Wei Wang

TL;DR

This work tackles the challenge of discovering and controlling semantic attributes directly within the latent space of pre-trained diffusion models without supervision. It introduces Region-Based Editing (RBE), which leverages the Jacobian of the denoising network with respect to region-specific latent vectors and applies an orthogonal projection to confine edits to a target region using a coarse mask. By combining power iteration to approximate Jacobian directions and masked Jacobian refinements, RBE enables precise local attribute editing while preserving global image structure, achieving state-of-the-art results on multiple datasets and sometimes surpassing supervised methods. The approach broadens the practical impact of diffusion models by enabling unsupervised, region-aware editing with broad applicability across architectures.

Abstract

Although diffusion models have achieved remarkable success in the field of image generation, their latent space remains under-explored. Current methods for identifying semantics within latent space often rely on external supervision, such as textual information and segmentation masks. In this paper, we propose a method to identify semantic attributes in the latent space of pre-trained diffusion models without any further training. By projecting the Jacobian of the targeted semantic region into a low-dimensional subspace which is orthogonal to the non-masked regions, our approach facilitates precise semantic discovery and control over local masked areas, eliminating the need for annotations. We conducted extensive experiments across multiple datasets and various architectures of diffusion models, achieving state-of-the-art performance. In particular, for some specific face attributes, the performance of our proposed method even surpasses that of supervised approaches, demonstrating its superior ability in editing local image properties.

Unsupervised Region-Based Image Editing of Denoising Diffusion Models

TL;DR

This work tackles the challenge of discovering and controlling semantic attributes directly within the latent space of pre-trained diffusion models without supervision. It introduces Region-Based Editing (RBE), which leverages the Jacobian of the denoising network with respect to region-specific latent vectors and applies an orthogonal projection to confine edits to a target region using a coarse mask. By combining power iteration to approximate Jacobian directions and masked Jacobian refinements, RBE enables precise local attribute editing while preserving global image structure, achieving state-of-the-art results on multiple datasets and sometimes surpassing supervised methods. The approach broadens the practical impact of diffusion models by enabling unsupervised, region-aware editing with broad applicability across architectures.

Abstract

Although diffusion models have achieved remarkable success in the field of image generation, their latent space remains under-explored. Current methods for identifying semantics within latent space often rely on external supervision, such as textual information and segmentation masks. In this paper, we propose a method to identify semantic attributes in the latent space of pre-trained diffusion models without any further training. By projecting the Jacobian of the targeted semantic region into a low-dimensional subspace which is orthogonal to the non-masked regions, our approach facilitates precise semantic discovery and control over local masked areas, eliminating the need for annotations. We conducted extensive experiments across multiple datasets and various architectures of diffusion models, achieving state-of-the-art performance. In particular, for some specific face attributes, the performance of our proposed method even surpasses that of supervised approaches, demonstrating its superior ability in editing local image properties.

Paper Structure

This paper contains 14 sections, 14 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Motivation of our proposed method. In this figure, mouth editing is used as an example. Direct editing often results in significant changes to areas beyond the mouth. By using Jacobian matrix projection, we can suppress these unwanted changes, allowing for more precise editing.
  • Figure 2: Overview of our semantic discovery method and editing method. Firstly, we define mask M and function $f$, which can be found in our method section. We also use DDIM inversion to precompute $x_t$ and $h_t$ for later use. We use the power iteration method to calculate the Jacobian matrix $J_t$, and use $J_t$ to calculate the required $V_t$. Finally, we set the modified timesteps, edit intensity and other parameters, and use DDIM to generate images. All algorithms and specific experimental settings can be found in the appendix.
  • Figure 3: Qualitative results of our method. We experimented with the pre-trained DDPM model on the CelebA-HQ dataset with a resolution of 256*256. The leftmost image is the original image. The green box corresponds to the area where we use the mask. The area of the image mask in each column is the same. Please note that we only use the mask during training and do not need to add it during testing. In our experiment, the mask of the same area may find different attribute results. For example, for the mouth, it can be a smile or a slanted mouth in one direction.
  • Figure 4: More semantic editing results
  • Figure 5: Qualitative results. Our method has the best results on the overall structure and details.
  • ...and 2 more figures