Table of Contents
Fetching ...

Training-free Color-Style Disentanglement for Constrained Text-to-Image Synthesis

Aishwarya Agarwal, Srikrishna Karanam, Balaji Vasan Srinivasan

TL;DR

This work presents the first training-free, test-time-only method to disentangle and condition text-to-image models on color and style attributes from reference image, and proposes two key innovations to transform the latent codes at inference time using feature transformations that make the covariance matrix of current generation follow that of the reference image.

Abstract

We consider the problem of independently, in a disentangled fashion, controlling the outputs of text-to-image diffusion models with color and style attributes of a user-supplied reference image. We present the first training-free, test-time-only method to disentangle and condition text-to-image models on color and style attributes from reference image. To realize this, we propose two key innovations. Our first contribution is to transform the latent codes at inference time using feature transformations that make the covariance matrix of current generation follow that of the reference image, helping meaningfully transfer color. Next, we observe that there exists a natural disentanglement between color and style in the LAB image space, which we exploit to transform the self-attention feature maps of the image being generated with respect to those of the reference computed from its L channel. Both these operations happen purely at test time and can be done independently or merged. This results in a flexible method where color and style information can come from the same reference image or two different sources, and a new generation can seamlessly fuse them in either scenario.

Training-free Color-Style Disentanglement for Constrained Text-to-Image Synthesis

TL;DR

This work presents the first training-free, test-time-only method to disentangle and condition text-to-image models on color and style attributes from reference image, and proposes two key innovations to transform the latent codes at inference time using feature transformations that make the covariance matrix of current generation follow that of the reference image.

Abstract

We consider the problem of independently, in a disentangled fashion, controlling the outputs of text-to-image diffusion models with color and style attributes of a user-supplied reference image. We present the first training-free, test-time-only method to disentangle and condition text-to-image models on color and style attributes from reference image. To realize this, we propose two key innovations. Our first contribution is to transform the latent codes at inference time using feature transformations that make the covariance matrix of current generation follow that of the reference image, helping meaningfully transfer color. Next, we observe that there exists a natural disentanglement between color and style in the LAB image space, which we exploit to transform the self-attention feature maps of the image being generated with respect to those of the reference computed from its L channel. Both these operations happen purely at test time and can be done independently or merged. This results in a flexible method where color and style information can come from the same reference image or two different sources, and a new generation can seamlessly fuse them in either scenario.
Paper Structure (11 sections, 3 equations, 16 figures, 2 tables)

This paper contains 11 sections, 3 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Attribute entanglement in prior appearance transfer works alaluf2023cross.
  • Figure 2: Attribute disentanglement in LAB space.
  • Figure 3: A visual illustration of the proposed method. (a) Obtain inverted latents for the style and color reference images, (b) Sample a new latent given a prompt (e.g., a bird above) and begin the denoising process for generating new images, (c) Perform self attention KV injection for style transfer, and/or masked latent recoloring for color transfer during this ongoing denoising process, (d) Utilise the intermediates obtained from style/color reference image reconstruction via style/color inverted latents respectively in step (c)
  • Figure 4: Intermediate decoded latents demonstrate the progression of color and fine-grained style information across denoising timesteps.
  • Figure 5: Progression of recolorised decoded latents across denoising timesteps.
  • ...and 11 more figures