Table of Contents
Fetching ...

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal

TL;DR

This work targets robust, layout-guided image generation under out-of-distribution configurations. It introduces LayoutBench, a CLEVR-based benchmark that isolates four spatial-control skills (number, position, size, shape) and evaluates layout accuracy via DETR-based AP, revealing strong ID–OOD gaps in existing methods. To address this, it proposes IterInpaint, an iterative inpainting approach built on Stable Diffusion that updates foreground and background region-by-region, achieving substantially better OOD generalization with competitive ID performance. Comprehensive experiments, ablations, and zero-shot evaluations on LayoutBench-COCO demonstrate that IterInpaint consistently outperforms state-of-the-art baselines across all four skills, underscoring the value of iterative, region-centric generation for reliable spatial control. The work provides a practical framework for diagnosing and improving spatial controllability in diffusion-based image generation.

Abstract

Spatial control is a core capability in controllable image generation. Advancements in layout-guided image generation have shown promising results on in-distribution (ID) datasets with similar spatial configurations. However, it is unclear how these models perform when facing out-of-distribution (OOD) samples with arbitrary, unseen layouts. In this paper, we propose LayoutBench, a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape. We benchmark two recent representative layout-guided image generation methods and observe that the good ID layout control may not generalize well to arbitrary layouts in the wild (e.g., objects at the boundary). Next, we propose IterInpaint, a new baseline that generates foreground and background regions step-by-step via inpainting, demonstrating stronger generalizability than existing models on OOD layouts in LayoutBench. We perform quantitative and qualitative evaluation and fine-grained analysis on the four LayoutBench skills to pinpoint the weaknesses of existing models. We show comprehensive ablation studies on IterInpaint, including training task ratio, crop&paste vs. repaint, and generation order. Lastly, we evaluate the zero-shot performance of different pretrained layout-guided image generation models on LayoutBench-COCO, our new benchmark for OOD layouts with real objects, where our IterInpaint consistently outperforms SOTA baselines in all four splits. Project website: https://layoutbench.github.io

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

TL;DR

This work targets robust, layout-guided image generation under out-of-distribution configurations. It introduces LayoutBench, a CLEVR-based benchmark that isolates four spatial-control skills (number, position, size, shape) and evaluates layout accuracy via DETR-based AP, revealing strong ID–OOD gaps in existing methods. To address this, it proposes IterInpaint, an iterative inpainting approach built on Stable Diffusion that updates foreground and background region-by-region, achieving substantially better OOD generalization with competitive ID performance. Comprehensive experiments, ablations, and zero-shot evaluations on LayoutBench-COCO demonstrate that IterInpaint consistently outperforms state-of-the-art baselines across all four skills, underscoring the value of iterative, region-centric generation for reliable spatial control. The work provides a practical framework for diagnosing and improving spatial controllability in diffusion-based image generation.

Abstract

Spatial control is a core capability in controllable image generation. Advancements in layout-guided image generation have shown promising results on in-distribution (ID) datasets with similar spatial configurations. However, it is unclear how these models perform when facing out-of-distribution (OOD) samples with arbitrary, unseen layouts. In this paper, we propose LayoutBench, a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape. We benchmark two recent representative layout-guided image generation methods and observe that the good ID layout control may not generalize well to arbitrary layouts in the wild (e.g., objects at the boundary). Next, we propose IterInpaint, a new baseline that generates foreground and background regions step-by-step via inpainting, demonstrating stronger generalizability than existing models on OOD layouts in LayoutBench. We perform quantitative and qualitative evaluation and fine-grained analysis on the four LayoutBench skills to pinpoint the weaknesses of existing models. We show comprehensive ablation studies on IterInpaint, including training task ratio, crop&paste vs. repaint, and generation order. Lastly, we evaluate the zero-shot performance of different pretrained layout-guided image generation models on LayoutBench-COCO, our new benchmark for OOD layouts with real objects, where our IterInpaint consistently outperforms SOTA baselines in all four splits. Project website: https://layoutbench.github.io
Paper Structure (50 sections, 6 figures, 24 tables)

This paper contains 50 sections, 6 figures, 24 tables.

Figures (6)

  • Figure 1: We propose LayoutBench (\ref{['sec:benchmark']}), a diagnostic benchmark for layout-guided image generation models with out-of-distribution (OOD) layouts in four skills: number, position, size, and shape. Existing models such as ReCo Yang2022ReCo fail on OOD layouts by misplacing objects. Next, we introduce IterInpaint (\ref{['sec:method']}), a new baseline model with a better generalization on OOD layouts.
  • Figure 2: In LayoutBench, we measure 4 spatial control skills (number, position, size, shape) for layout-guided image generation. First, 1) we query the image generation models with OOD layouts. Then, 2) we detect the objects from the generated images, and calculate the layout accuracy in average precision (AP). In each image, the ground-truth boxes are shown in blue and the objects detected are shown in red. The images are generated by ReCo Yang2022ReCo trained on CLEVR johnson2017clevr, where it often misplaces (, many red boxes outside of blue boxes) or misses objects (, many blue boxes are missed) on OOD layouts from LayoutBench.
  • Figure 3: IterInpaint Training. Our model is trained with (1) foreground and (2) background inpainting tasks (\ref{['sec:iterative_inpainting']}).
  • Figure 4: IterInpaint Inference. Illustration (left) and Python pseudocode (right) of layout-guided image generation with IterInpaint (\ref{['sec:iterative_inpainting']}). At each iteration, the inpainting model takes the prompt, mask, and previous image as inputs and generates a new image.
  • Figure 5: Detailed layout accuracy analysis with fine-grained splits of 4 LayoutBench skills. In-distribution (same attributes to CLEVR) splits are colored in gray. For the Shape skill, the splits are named after their height/width ratio ( H2W1 split consists of the objects with a 2:1 ratio of height:width).
  • ...and 1 more figures