Table of Contents
Fetching ...

Test-time Conditional Text-to-Image Synthesis Using Diffusion Models

Tripti Shukla, Srikrishna Karanam, Balaji Vasan Srinivasan

TL;DR

The proposed TINTIN method is a new training-free test-time only algorithm to condition text-to-image diffusion model outputs on conditioning factors such as color palettes and edge maps, which results in the first approach to control model outputs with input color palettes, which is realized using a novel color distribution matching loss.

Abstract

We consider the problem of conditional text-to-image synthesis with diffusion models. Most recent works need to either finetune specific parts of the base diffusion model or introduce new trainable parameters, leading to deployment inflexibility due to the need for training. To address this gap in the current literature, we propose our method called TINTIN: Test-time Conditional Text-to-Image Synthesis using Diffusion Models which is a new training-free test-time only algorithm to condition text-to-image diffusion model outputs on conditioning factors such as color palettes and edge maps. In particular, we propose to interpret noise predictions during denoising as gradients of an energy-based model, leading to a flexible approach to manipulate the noise by matching predictions inferred from them to the ground truth conditioning input. This results in, to the best of our knowledge, the first approach to control model outputs with input color palettes, which we realize using a novel color distribution matching loss. We also show this test-time noise manipulation can be easily extensible to other types of conditioning, e.g., edge maps. We conduct extensive experiments using a variety of text prompts, color palettes, and edge maps and demonstrate significant improvement over the current state-of-the-art, both qualitatively and quantitatively.

Test-time Conditional Text-to-Image Synthesis Using Diffusion Models

TL;DR

The proposed TINTIN method is a new training-free test-time only algorithm to condition text-to-image diffusion model outputs on conditioning factors such as color palettes and edge maps, which results in the first approach to control model outputs with input color palettes, which is realized using a novel color distribution matching loss.

Abstract

We consider the problem of conditional text-to-image synthesis with diffusion models. Most recent works need to either finetune specific parts of the base diffusion model or introduce new trainable parameters, leading to deployment inflexibility due to the need for training. To address this gap in the current literature, we propose our method called TINTIN: Test-time Conditional Text-to-Image Synthesis using Diffusion Models which is a new training-free test-time only algorithm to condition text-to-image diffusion model outputs on conditioning factors such as color palettes and edge maps. In particular, we propose to interpret noise predictions during denoising as gradients of an energy-based model, leading to a flexible approach to manipulate the noise by matching predictions inferred from them to the ground truth conditioning input. This results in, to the best of our knowledge, the first approach to control model outputs with input color palettes, which we realize using a novel color distribution matching loss. We also show this test-time noise manipulation can be easily extensible to other types of conditioning, e.g., edge maps. We conduct extensive experiments using a variety of text prompts, color palettes, and edge maps and demonstrate significant improvement over the current state-of-the-art, both qualitatively and quantitatively.

Paper Structure

This paper contains 11 sections, 12 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison of TINTIN with FreeDoM yu2023freedom for color and edge conditions.
  • Figure 2: The overall architecture of our training-free approach, TINTIN. Reference, R is a list of hex values for color control and a reference image for edge control, $x_t$ is the input latent code generated from random Gaussian Distribution, $x_{t-1}$ is the latent code at timestep $t-1$ and $I_{t-1}$ is the denoised RGB image. Condition Generator network projects the condition control and the denoised image in the same space denoted by $I_R$ and $I'_{t-1}$ respectively.
  • Figure 3: Demonstration of the amplified effect of applying conditional control in specific time interval for color and edge conditioning. It can be observed that the color conditioning happens in the middle stage of the sampling process whereas the edge conditioning happens in the early stage of sampling. We can observe that in the second row, the position and structure of the cat changes rapidly to fit the reference edge map.
  • Figure 4: (a) We illustrate the ability of TINTIN in generating color palette conditioned images against trainable methods like T2I-Adapter and ControlNet and training-free methods like FreeDoM. (b) Ablation analysis of our method (TINTIN) for color conditioning. (c) Ablation analysis of our method (TINTIN) for edge conditioning.
  • Figure 5: We compare TINTIN's edge map conditioned image generation against trainable methods (T2I-Adapter and ControlNet) and training-free methods (MasaCtrl and FreeDoM). TINTIN exhibits superior diversity in image generation, closely following the structure of the reference edge map, outperforming other methods.
  • ...and 2 more figures