Table of Contents
Fetching ...

Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis

Theodoros Kouzelis, Manos Plitsis, Mihalis A. Nicolaou, Yannis Panagakis

TL;DR

The challenge of local image manipulation in DMs is addressed and an unsupervised method to factorize the latent semantics learned by the denoising network of pre-trained DMs is introduced, suitable for practical applications.

Abstract

Recent advances in Diffusion Models (DMs) have led to significant progress in visual synthesis and editing tasks, establishing them as a strong competitor to Generative Adversarial Networks (GANs). However, the latent space of DMs is not as well understood as that of GANs. Recent research has focused on unsupervised semantic discovery in the latent space of DMs by leveraging the bottleneck layer of the denoising network, which has been shown to exhibit properties of a semantic latent space. However, these approaches are limited to discovering global attributes. In this paper we address, the challenge of local image manipulation in DMs and introduce an unsupervised method to factorize the latent semantics learned by the denoising network of pre-trained DMs. Given an arbitrary image and defined regions of interest, we utilize the Jacobian of the denoising network to establish a relation between the regions of interest and their corresponding subspaces in the latent space. Furthermore, we disentangle the joint and individual components of these subspaces to identify latent directions that enable local image manipulation. Once discovered, these directions can be applied to different images to produce semantically consistent edits, making our method suitable for practical applications. Experimental results on various datasets demonstrate that our method can produce semantic edits that are more localized and have better fidelity compared to the state-of-the-art.

Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis

TL;DR

The challenge of local image manipulation in DMs is addressed and an unsupervised method to factorize the latent semantics learned by the denoising network of pre-trained DMs is introduced, suitable for practical applications.

Abstract

Recent advances in Diffusion Models (DMs) have led to significant progress in visual synthesis and editing tasks, establishing them as a strong competitor to Generative Adversarial Networks (GANs). However, the latent space of DMs is not as well understood as that of GANs. Recent research has focused on unsupervised semantic discovery in the latent space of DMs by leveraging the bottleneck layer of the denoising network, which has been shown to exhibit properties of a semantic latent space. However, these approaches are limited to discovering global attributes. In this paper we address, the challenge of local image manipulation in DMs and introduce an unsupervised method to factorize the latent semantics learned by the denoising network of pre-trained DMs. Given an arbitrary image and defined regions of interest, we utilize the Jacobian of the denoising network to establish a relation between the regions of interest and their corresponding subspaces in the latent space. Furthermore, we disentangle the joint and individual components of these subspaces to identify latent directions that enable local image manipulation. Once discovered, these directions can be applied to different images to produce semantically consistent edits, making our method suitable for practical applications. Experimental results on various datasets demonstrate that our method can produce semantic edits that are more localized and have better fidelity compared to the state-of-the-art.
Paper Structure (21 sections, 6 equations, 10 figures, 1 table, 3 algorithms)

This paper contains 21 sections, 6 equations, 10 figures, 1 table, 3 algorithms.

Figures (10)

  • Figure 1: Local Editing with our method: Given regions of interest we can identify latent directions that result in diverse semantic edits without affecting the rest of the image. Linear interpolation within the identified semantic directions leads to gradual changes in the generated image like opening and closing the eyes.
  • Figure 2: An overview of our method. Left: The regions of interest are selected. In this example, region $a$ and region $b$ correspond to the eyes and the mouth respectively. Center: The row space of the Jacobian of each region $\mathbf{V}^a$ and $\mathbf{V}^b$ is decomposed to the joint subspace $\mathbf{V}_C$ and the individual subspaces $\mathbf{V}_A^a$, $\mathbf{V}_A^b$. Right: Editing in $\mathcal{H}$ with directions from the joint subspace results in global edits, whereas editing with directions from the individual subspaces results in localized edits.
  • Figure 3: Editing with the joint and individual components for the CelebA-HQ, LSUN-Churches and MetFaces datasets. Regions of interest are denoted by a pink rectangle. By decomposing the Jacobians of each region into a joint and individual component we can disentangle the global and the local semantic variation
  • Figure 4: Local editing results on the CelebA-HQ (top), MetFaces and LSUN-Churches (bottom). The region of interest is highlighted with pink rectangles. Our method can identify diverse semantic manipulations within a region while not affecting the rest of the image. Note that the latent vectors used to edit the images in each row are derived from the image in the first row.
  • Figure 5: Qualitative comparison between our method and existing alternatives for two local edits.
  • ...and 5 more figures