Table of Contents
Fetching ...

ToLo: A Two-Stage, Training-Free Layout-To-Image Generation Framework For High-Overlap Layouts

Linhao Huang, Jing Yu

TL;DR

This paper tackles the challenge of accurate layout-to-image generation when input layouts exhibit significant overlap. It introduces ToLo, a two-stage, training-free framework that first aggregates attention maps within their target regions and then separates them to reduce cross-concept leakage, guided by $L_{ m agg}$ and $L_{ m sep}$. By applying ToLo to the RnB baseline and evaluating on an IoU-partitioned version of HRS-Bench, the authors demonstrate robust improvements for high-overlap layouts, while noting some trade-offs in object size control that can be mitigated by an IoU-based mode switch. The work contributes a practical, inference-time method that enhances spatial fidelity in diffusion-based LIS and provides a new dataset partitioning strategy to benchmark overlap handling.

Abstract

Recent training-free layout-to-image diffusion models have demonstrated remarkable performance in generating high-quality images with controllable layouts. These models follow a one-stage framework: Encouraging the model to focus the attention map of each concept on its corresponding region by defining attention map-based losses. However, these models still struggle to accurately follow layouts with significant overlap, often leading to issues like attribute leakage and missing entities. In this paper, we propose ToLo, a two-stage, training-free layout-to-image generation framework for high-overlap layouts. Our framework consists of two stages: the aggregation stage and the separation stage, each with its own loss function based on the attention map. To provide a more effective evaluation, we partition the HRS dataset based on the Intersection over Union (IoU) of the input layouts, creating a new dataset for layout-to-image generation with varying levels of overlap. Through extensive experiments on this dataset, we demonstrate that ToLo significantly enhances the performance of existing methods when dealing with high-overlap layouts. Our code and dataset are available here: https://github.com/misaka12435/ToLo.

ToLo: A Two-Stage, Training-Free Layout-To-Image Generation Framework For High-Overlap Layouts

TL;DR

This paper tackles the challenge of accurate layout-to-image generation when input layouts exhibit significant overlap. It introduces ToLo, a two-stage, training-free framework that first aggregates attention maps within their target regions and then separates them to reduce cross-concept leakage, guided by and . By applying ToLo to the RnB baseline and evaluating on an IoU-partitioned version of HRS-Bench, the authors demonstrate robust improvements for high-overlap layouts, while noting some trade-offs in object size control that can be mitigated by an IoU-based mode switch. The work contributes a practical, inference-time method that enhances spatial fidelity in diffusion-based LIS and provides a new dataset partitioning strategy to benchmark overlap handling.

Abstract

Recent training-free layout-to-image diffusion models have demonstrated remarkable performance in generating high-quality images with controllable layouts. These models follow a one-stage framework: Encouraging the model to focus the attention map of each concept on its corresponding region by defining attention map-based losses. However, these models still struggle to accurately follow layouts with significant overlap, often leading to issues like attribute leakage and missing entities. In this paper, we propose ToLo, a two-stage, training-free layout-to-image generation framework for high-overlap layouts. Our framework consists of two stages: the aggregation stage and the separation stage, each with its own loss function based on the attention map. To provide a more effective evaluation, we partition the HRS dataset based on the Intersection over Union (IoU) of the input layouts, creating a new dataset for layout-to-image generation with varying levels of overlap. Through extensive experiments on this dataset, we demonstrate that ToLo significantly enhances the performance of existing methods when dealing with high-overlap layouts. Our code and dataset are available here: https://github.com/misaka12435/ToLo.

Paper Structure

This paper contains 17 sections, 14 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Existing training-free layout-to-image synthesis methods struggle when the input layout contains large overlaps. Since they do not separate the attention maps for different objects, this often results in overlapping attention maps, which can cause the attribute leakage and missing entities. ToLo alleviates the problem of attribute leakage and missing entities while maintaining precise spatial control.
  • Figure 2: Attention maps of "red apple" and "yellow clock". In the original R&B, the attention regions for the entities "red apple" and "yellow clock" overlap, with the attention map for the "yellow clock" being indistinguishable from that of the "red apple", resulting in the entity being missed (i.e., the "yellow clock" is omitted). However, after applying ToLo, the overlap is significantly alleviated, thereby the prompt "A red apple and a yellow clock" is correctly generated. More examples can be found in the Appendix \ref{['appendix_attentionmap']}.
  • Figure 3: Overview of ToLo. In aggregation stage, $L_{\rm agg}$ encourages the attention map of each concept to focus on its respective bounding box, achieving precise control over the spatial position of concept. In separation stage, $L_{\rm sep}$ ensures the separation between attention maps for different concepts, thereby alleviate the overlap problem of attention maps.
  • Figure 4: Qualitative result. All methods take the same grounded texts as inputs. The results show that our proposed ToLo can effectively alleviate problems such as attribute leakage and missing entities.
  • Figure 5: More examples of overlapping attention maps.