Table of Contents
Fetching ...

GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation

Phillip Y. Lee, Taehoon Yoon, Minhyuk Sung

TL;DR

GrounDiT, a novel training-free spatial grounding technique for text-to-image generation using Diffusion Transformers, demonstrates that DiT can generate noisy patches corresponding to each bounding box, fully encoding the target object and allowing for fine-grained control over each region.

Abstract

We introduce GrounDiT, a novel training-free spatial grounding technique for text-to-image generation using Diffusion Transformers (DiT). Spatial grounding with bounding boxes has gained attention for its simplicity and versatility, allowing for enhanced user control in image generation. However, prior training-free approaches often rely on updating the noisy image during the reverse diffusion process via backpropagation from custom loss functions, which frequently struggle to provide precise control over individual bounding boxes. In this work, we leverage the flexibility of the Transformer architecture, demonstrating that DiT can generate noisy patches corresponding to each bounding box, fully encoding the target object and allowing for fine-grained control over each region. Our approach builds on an intriguing property of DiT, which we refer to as semantic sharing. Due to semantic sharing, when a smaller patch is jointly denoised alongside a generatable-size image, the two become semantic clones. Each patch is denoised in its own branch of the generation process and then transplanted into the corresponding region of the original noisy image at each timestep, resulting in robust spatial grounding for each bounding box. In our experiments on the HRS and DrawBench benchmarks, we achieve state-of-the-art performance compared to previous training-free approaches.

GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation

TL;DR

GrounDiT, a novel training-free spatial grounding technique for text-to-image generation using Diffusion Transformers, demonstrates that DiT can generate noisy patches corresponding to each bounding box, fully encoding the target object and allowing for fine-grained control over each region.

Abstract

We introduce GrounDiT, a novel training-free spatial grounding technique for text-to-image generation using Diffusion Transformers (DiT). Spatial grounding with bounding boxes has gained attention for its simplicity and versatility, allowing for enhanced user control in image generation. However, prior training-free approaches often rely on updating the noisy image during the reverse diffusion process via backpropagation from custom loss functions, which frequently struggle to provide precise control over individual bounding boxes. In this work, we leverage the flexibility of the Transformer architecture, demonstrating that DiT can generate noisy patches corresponding to each bounding box, fully encoding the target object and allowing for fine-grained control over each region. Our approach builds on an intriguing property of DiT, which we refer to as semantic sharing. Due to semantic sharing, when a smaller patch is jointly denoised alongside a generatable-size image, the two become semantic clones. Each patch is denoised in its own branch of the generation process and then transplanted into the corresponding region of the original noisy image at each timestep, resulting in robust spatial grounding for each bounding box. In our experiments on the HRS and DrawBench benchmarks, we achieve state-of-the-art performance compared to previous training-free approaches.

Paper Structure

This paper contains 38 sections, 10 equations, 9 figures, 6 tables, 2 algorithms.

Figures (9)

  • Figure 1: Spatially grounded images generated by our GrounDiT. Each image is generated based on a text prompt along with bounding boxes, which are displayed in the upper right corner of each image. Compared to existing methods that often struggle to accurately place objects within their designated bounding boxes, our GrounDiT enables more precise spatial control through a novel noisy patch transplantation mechanism.
  • Figure 2: A single denoising step in GrounDiT consists of two stages. The Global Update (Sec. \ref{['subsec:method_stage_1']}) established coarse spatial grounding by updating the noisy image with a custom loss function. Then, the Local Update (Sec. \ref{['subsec:method_stage_2']}) further provides fine-grained spatial control over individual bounding boxes through a novel technique called noisy patch transplantation.
  • Figure 3: (A) Joint Denoising. Two different noisy images, $\mathbf{x}_t$ and $\mathbf{y}_t$, are each assigned positional embeddings based on their respective sizes. The two sets of image tokens are then merged and passed through DiT for a denoising step. Afterward, the denoised tokens are split back into $\mathbf{x}_{t-1}$ and $\mathbf{y}_{t-1}$. (B), (C) Semantic Sharing. Denoising two noisy images using joint denoising results in semantically correlated content between the generated images. Here, $\gamma$ indicates that joint denoising is during the initial $100\gamma\%$ of the timesteps, after which the images are denoised for the remaining steps.
  • Figure 4: Qualitative comparisons between our GrounDiT and baselines. Leftmost column shows the input bounding boxes, and columns 2-6 include the baseline results. The rightmost column includes the results of our GrounDiT.
  • Figure 5: Spatially grounded images generated by our GrounDiT with varying aspect ratios and sizes. Each image is generated based on a text prompt along with bounding boxes, which are displayed next to (or below) each image.
  • ...and 4 more figures