Table of Contents
Fetching ...

Test-time Controllable Image Generation by Explicit Spatial Constraint Enforcement

Z. Zhang, B. Liu, J. Bao, L. Chen, S. Zhu, J. Yu

TL;DR

This paper tackles the challenge of test-time controllable text-to-image generation under natural prompts and complex spatial layouts without retraining. It decomposes spatial constraints into semantic and geometric components and enforces both via prompt completion, attention-map alignment, RoI-based latent relocation, and diffusion-based latent refill. The approach yields substantial gains on the Coco-stuff dataset, including improvements in layout-consistency ($\$30\%$ relative boost) and AP-based metrics compared to training-free baselines, while maintaining image quality. By enabling flexible, open-world layout control without labeled data or fine-tuning, the method significantly enhances practical controllability of diffusion-based generation in real-world applications.

Abstract

Recent text-to-image generation favors various forms of spatial conditions, e.g., masks, bounding boxes, and key points. However, the majority of the prior art requires form-specific annotations to fine-tune the original model, leading to poor test-time generalizability. Meanwhile, existing training-free methods work well only with simplified prompts and spatial conditions. In this work, we propose a novel yet generic test-time controllable generation method that aims at natural text prompts and complex conditions. Specifically, we decouple spatial conditions into semantic and geometric conditions and then enforce their consistency during the image-generation process individually. As for the former, we target bridging the gap between the semantic condition and text prompts, as well as the gap between such condition and the attention map from diffusion models. To achieve this, we propose to first complete the prompt w.r.t. semantic condition, and then remove the negative impact of distracting prompt words by measuring their statistics in attention maps as well as distances in word space w.r.t. this condition. To further cope with the complex geometric conditions, we introduce a geometric transform module, in which Region-of-Interests will be identified in attention maps and further used to translate category-wise latents w.r.t. geometric condition. More importantly, we propose a diffusion-based latents-refill method to explicitly remove the impact of latents at the RoI, reducing the artifacts on generated images. Experiments on Coco-stuff dataset showcase 30$\%$ relative boost compared to SOTA training-free methods on layout consistency evaluation metrics.

Test-time Controllable Image Generation by Explicit Spatial Constraint Enforcement

TL;DR

This paper tackles the challenge of test-time controllable text-to-image generation under natural prompts and complex spatial layouts without retraining. It decomposes spatial constraints into semantic and geometric components and enforces both via prompt completion, attention-map alignment, RoI-based latent relocation, and diffusion-based latent refill. The approach yields substantial gains on the Coco-stuff dataset, including improvements in layout-consistency (30\%$ relative boost) and AP-based metrics compared to training-free baselines, while maintaining image quality. By enabling flexible, open-world layout control without labeled data or fine-tuning, the method significantly enhances practical controllability of diffusion-based generation in real-world applications.

Abstract

Recent text-to-image generation favors various forms of spatial conditions, e.g., masks, bounding boxes, and key points. However, the majority of the prior art requires form-specific annotations to fine-tune the original model, leading to poor test-time generalizability. Meanwhile, existing training-free methods work well only with simplified prompts and spatial conditions. In this work, we propose a novel yet generic test-time controllable generation method that aims at natural text prompts and complex conditions. Specifically, we decouple spatial conditions into semantic and geometric conditions and then enforce their consistency during the image-generation process individually. As for the former, we target bridging the gap between the semantic condition and text prompts, as well as the gap between such condition and the attention map from diffusion models. To achieve this, we propose to first complete the prompt w.r.t. semantic condition, and then remove the negative impact of distracting prompt words by measuring their statistics in attention maps as well as distances in word space w.r.t. this condition. To further cope with the complex geometric conditions, we introduce a geometric transform module, in which Region-of-Interests will be identified in attention maps and further used to translate category-wise latents w.r.t. geometric condition. More importantly, we propose a diffusion-based latents-refill method to explicitly remove the impact of latents at the RoI, reducing the artifacts on generated images. Experiments on Coco-stuff dataset showcase 30 relative boost compared to SOTA training-free methods on layout consistency evaluation metrics.
Paper Structure (22 sections, 7 equations, 19 figures, 4 tables)

This paper contains 22 sections, 7 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: By enforcing semantic and geometric consistency, we can generate images with more precisely-located objects w.r.t. given layouts.
  • Figure 2: Given text prompts and layouts, our method enforces semantic consistency by prompt editing and attention map matching. Then the geometric consistency is incorporated by identifying, relocating, and refilling the latents. Thanks to both designs, our method leads to more realistic and consistent image generating process.
  • Figure 3: Prompt editing enables the discovery of missing objects by comparing semantics in caption and layouts.
  • Figure 4: Instead of working on attention maps according to their default semantic token, we propose to match attention map w.r.t. their statistics and distance in word space.
  • Figure 5: To refill $b^t$ with natural values, we introduce a lightweight diffusion model to iteratively update it.
  • ...and 14 more figures