Table of Contents
Fetching ...

SpotActor: Training-Free Layout-Controlled Consistent Image Generation

Jiahao Wang, Caixia Yan, Weizhan Zhang, Haonan Lin, Mengmeng Wang, Guang Dai, Tieliang Gong, Hao Sun, Jingdong Wang

TL;DR

A new formalization of dual energy guidance with optimization in a dual semantic-latent space is presented and thus a training-free pipeline, SpotActor, is proposed, which features a layout-conditioned optimizing stage and a consistent sampling stage.

Abstract

Text-to-image diffusion models significantly enhance the efficiency of artistic creation with high-fidelity image generation. However, in typical application scenarios like comic book production, they can neither place each subject into its expected spot nor maintain the consistent appearance of each subject across images. For these issues, we pioneer a novel task, Layout-to-Consistent-Image (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts. To accomplish this challenging task, we present a new formalization of dual energy guidance with optimization in a dual semantic-latent space and thus propose a training-free pipeline, SpotActor, which features a layout-conditioned backward update stage and a consistent forward sampling stage. In the backward stage, we innovate a nuanced layout energy function to mimic the attention activations with a sigmoid-like objective. While in the forward stage, we design Regional Interconnection Self-Attention (RISA) and Semantic Fusion Cross-Attention (SFCA) mechanisms that allow mutual interactions across images. To evaluate the performance, we present ActorBench, a specified benchmark with hundreds of reasonable prompt-box pairs stemming from object detection datasets. Comprehensive experiments are conducted to demonstrate the effectiveness of our method. The results prove that SpotActor fulfills the expectations of this task and showcases the potential for practical applications with superior layout alignment, subject consistency, prompt conformity and background diversity.

SpotActor: Training-Free Layout-Controlled Consistent Image Generation

TL;DR

A new formalization of dual energy guidance with optimization in a dual semantic-latent space is presented and thus a training-free pipeline, SpotActor, is proposed, which features a layout-conditioned optimizing stage and a consistent sampling stage.

Abstract

Text-to-image diffusion models significantly enhance the efficiency of artistic creation with high-fidelity image generation. However, in typical application scenarios like comic book production, they can neither place each subject into its expected spot nor maintain the consistent appearance of each subject across images. For these issues, we pioneer a novel task, Layout-to-Consistent-Image (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts. To accomplish this challenging task, we present a new formalization of dual energy guidance with optimization in a dual semantic-latent space and thus propose a training-free pipeline, SpotActor, which features a layout-conditioned backward update stage and a consistent forward sampling stage. In the backward stage, we innovate a nuanced layout energy function to mimic the attention activations with a sigmoid-like objective. While in the forward stage, we design Regional Interconnection Self-Attention (RISA) and Semantic Fusion Cross-Attention (SFCA) mechanisms that allow mutual interactions across images. To evaluate the performance, we present ActorBench, a specified benchmark with hundreds of reasonable prompt-box pairs stemming from object detection datasets. Comprehensive experiments are conducted to demonstrate the effectiveness of our method. The results prove that SpotActor fulfills the expectations of this task and showcases the potential for practical applications with superior layout alignment, subject consistency, prompt conformity and background diversity.
Paper Structure (37 sections, 17 equations, 9 figures, 2 tables)

This paper contains 37 sections, 17 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Given bounding boxes and text prompts of subject and plot descriptions, our method generates high-quality images where subjects align to the layout as well as share a consistent appearance.
  • Figure 2: The overall architecture of SpotActor. (a) Our method consists of two stages at each sample step in a dual energy guidance manner. The backward stage optimizes the latent codes and semantic embeddings with the nuanced layout energy based on the sigmoid-like objective. Subsequently, the forward sampling is enhanced by two intricate attention mechanisms: (b) RISA and (c) SFCA.
  • Figure 3: Illustration of the attention analysis. (a) IntraM is the attention map normalized within each token, while (b) InterM is normalized across all the tokens. We further visualize (c) 3D distributions of attention maps and propose (d) sigmoid-like approximate distributions.
  • Figure 4: The qualitative comparison between baselines and our SpotActor. Our method shows superior layout controllability compared to LayoutGuidance and exhibits better subject consistency compared to StoryDiffusion. The central subjects are marked in blue and the given boxes are outlined in blue lines.
  • Figure 5: Illustration of double subject generation by SpotActor. Our method maintains excellent performance when handling multiple subjects. Different central subjects are marked in different colors.
  • ...and 4 more figures