Table of Contents
Fetching ...

Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation

Jiaxin Cheng, Zixu Zhao, Tong He, Tianjun Xiao, Yicong Zhou, Zheng Zhang

TL;DR

The paper tackles open-set, rich-context layout-to-image (L2I) generation by introducing a regional cross-attention module that processes per-object descriptions within regionally reorganized layout regions. It enhances locality, completeness, and collectiveness by cross-attending object text to corresponding visual regions and encoding descriptions with bounding box indicators (Sequenced Grounding Encoding). To evaluate open-set L2I, the authors propose CropCLIP for object-label alignment and SAMIoU for layout fidelity, validated through a human-user study that confirms reliability. Experiments using SDXL/SD1.5 backbones on synthetic Rich-Context CC3M/RC COCO data show improved generation quality and reduced computation in layout-conditioning layers, especially for complex/descriptive prompts. The work provides a practical framework for richer, more precise L2I generation and offers open-set evaluation benchmarks that align with human judgments.

Abstract

Recent advancements in generative models have significantly enhanced their capacity for image generation, enabling a wide range of applications such as image editing, completion and video editing. A specialized area within generative modeling is layout-to-image (L2I) generation, where predefined layouts of objects guide the generative process. In this study, we introduce a novel regional cross-attention module tailored to enrich layout-to-image generation. This module notably improves the representation of layout regions, particularly in scenarios where existing methods struggle with highly complex and detailed textual descriptions. Moreover, while current open-vocabulary L2I methods are trained in an open-set setting, their evaluations often occur in closed-set environments. To bridge this gap, we propose two metrics to assess L2I performance in open-vocabulary scenarios. Additionally, we conduct a comprehensive user study to validate the consistency of these metrics with human preferences.

Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation

TL;DR

The paper tackles open-set, rich-context layout-to-image (L2I) generation by introducing a regional cross-attention module that processes per-object descriptions within regionally reorganized layout regions. It enhances locality, completeness, and collectiveness by cross-attending object text to corresponding visual regions and encoding descriptions with bounding box indicators (Sequenced Grounding Encoding). To evaluate open-set L2I, the authors propose CropCLIP for object-label alignment and SAMIoU for layout fidelity, validated through a human-user study that confirms reliability. Experiments using SDXL/SD1.5 backbones on synthetic Rich-Context CC3M/RC COCO data show improved generation quality and reduced computation in layout-conditioning layers, especially for complex/descriptive prompts. The work provides a practical framework for richer, more precise L2I generation and offers open-set evaluation benchmarks that align with human judgments.

Abstract

Recent advancements in generative models have significantly enhanced their capacity for image generation, enabling a wide range of applications such as image editing, completion and video editing. A specialized area within generative modeling is layout-to-image (L2I) generation, where predefined layouts of objects guide the generative process. In this study, we introduce a novel regional cross-attention module tailored to enrich layout-to-image generation. This module notably improves the representation of layout regions, particularly in scenarios where existing methods struggle with highly complex and detailed textual descriptions. Moreover, while current open-vocabulary L2I methods are trained in an open-set setting, their evaluations often occur in closed-set environments. To bridge this gap, we propose two metrics to assess L2I performance in open-vocabulary scenarios. Additionally, we conduct a comprehensive user study to validate the consistency of these metrics with human preferences.
Paper Structure (23 sections, 3 equations, 10 figures, 4 tables, 2 algorithms)

This paper contains 23 sections, 3 equations, 10 figures, 4 tables, 2 algorithms.

Figures (10)

  • Figure 1: The proposed method demonstrates the ability to accurately generate objects with complex descriptions in the correct locations while faithfully preserving the details specified in the text. In contrast, existing methods such as BoxDiff boxdiff, R&B randb, GLIGEN li2023gligen, and InstDiff instdiff struggle with the complex object descriptions, leading to errors in the generated objects.
  • Figure 2: An example of regional cross-attention with two overlapping objects. Cross-attention is applied to each pair of regional visual and grounded textual tokens. The overlapping region cross-attends with the textual tokens containing both objects, while the non-object region attends to a learnable "$null$" token.
  • Figure 3: Sequenced Grounding Encoding with box coordinates as indicators.
  • Figure 4: Statistical comparisons between the synthetic object descriptions generated by GLIGEN li2023gligen, InstDiff instdiff, and our method. We measure the 1) average caption length, 2) the Gunning Fog Score, which estimates the text complexity from the education level required to understand the text, 3) the number of unique words per sample which indicates the text diversity, and 4) the object-label CLIP Alignment Score to measure object-label alignment. The results show that the pseudo-labels generated for our dataset are more complex, diverse, lengthier, and align better with objects, compared to those generated by GLIGEN and InstDiff.
  • Figure 5: Qualitative comparison of rich-context L2I generation, showcasing our method alongside open-set L2I approaches GLIGEN li2023gligen and InstDiff instdiff, based on detailed object descriptions. Our method consistently generates more accurate representations of objects, particularly in terms of specific attributes such as colors and shapes. Strikethrough text indicates missing content in the generated objects from the descriptions. More qualitative results available in \ref{['appendix.more_results']}
  • ...and 5 more figures