Table of Contents
Fetching ...

DivCon: Divide and Conquer for Complex Numerical and Spatial Reasoning in Text-to-Image Generation

Yuhao Jia, Wenhan Tan

TL;DR

DivCon introduces a training-free divide-and-conquer framework for layout-based text-to-image generation. By splitting layout prediction into numerical/spatial reasoning and constrained bounding-box planning, a lightweight LLM matches large-scale models in layout accuracy. It then performs a two-round, progressive layout-to-image generation, first producing a full scene and then refining hard objects using consistency-guided refinement. Experiments on HRS and NSR-1K show state-of-the-art gains in both layout fidelity and image quality, with notable improvements in multi-object and complex-spatial prompts and reduced reliance on large LLMs.

Abstract

Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements in recent years. To further improve T2I models' capability in numerical and spatial reasoning, layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods often rely on closed-source, large-scale LLMs for layout prediction, limiting accessibility and scalability. They also struggle with generating images from prompts with multiple objects and complicated spatial relationships. To tackle these challenges, we introduce a divide-and-conquer approach which decouples the generation task into multiple subtasks. First, the layout prediction stage is divided into numerical & spatial reasoning and bounding box visual planning, enabling even lightweight LLMs to achieve layout accuracy comparable to large-scale models. Second, the layout-to-image generation stage is divided into two steps to synthesize objects from easy ones to difficult ones. Experiments are conducted on the HRS and NSR-1K benchmarks and our method outperforms previous approaches with notable margins. In addition, visual results and user study demonstrate that our approach significantly improves the perceptual quality, especially when generating multiple objects from complex textural prompts.

DivCon: Divide and Conquer for Complex Numerical and Spatial Reasoning in Text-to-Image Generation

TL;DR

DivCon introduces a training-free divide-and-conquer framework for layout-based text-to-image generation. By splitting layout prediction into numerical/spatial reasoning and constrained bounding-box planning, a lightweight LLM matches large-scale models in layout accuracy. It then performs a two-round, progressive layout-to-image generation, first producing a full scene and then refining hard objects using consistency-guided refinement. Experiments on HRS and NSR-1K show state-of-the-art gains in both layout fidelity and image quality, with notable improvements in multi-object and complex-spatial prompts and reduced reliance on large LLMs.

Abstract

Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements in recent years. To further improve T2I models' capability in numerical and spatial reasoning, layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods often rely on closed-source, large-scale LLMs for layout prediction, limiting accessibility and scalability. They also struggle with generating images from prompts with multiple objects and complicated spatial relationships. To tackle these challenges, we introduce a divide-and-conquer approach which decouples the generation task into multiple subtasks. First, the layout prediction stage is divided into numerical & spatial reasoning and bounding box visual planning, enabling even lightweight LLMs to achieve layout accuracy comparable to large-scale models. Second, the layout-to-image generation stage is divided into two steps to synthesize objects from easy ones to difficult ones. Experiments are conducted on the HRS and NSR-1K benchmarks and our method outperforms previous approaches with notable margins. In addition, visual results and user study demonstrate that our approach significantly improves the perceptual quality, especially when generating multiple objects from complex textural prompts.
Paper Structure (25 sections, 3 equations, 13 figures, 8 tables)

This paper contains 25 sections, 3 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Layouts and images generated by our DivCon. DivCon enhances the capability of text-to-image diffusion models to understand complex numerical and spatial relationships in the text.
  • Figure 2: A comparison of generation difficulty between "toaster" and "pizza".
  • Figure 3: Comparison of easy and hard samples in a generated image.
  • Figure 4: The proposed DivCon framework. The layout prediction stage (stage 1) is divided into numerical & spatial reasoning and bounding box planning. The layout-to-image generation stage (stage 2) is divided into two steps to generate objects with different difficulty levels separately.
  • Figure 5: Qualitative comparison of numerical & spatial reasoning between Stable Diffusion, Attend-and-excite, Attention -Refocusing and our DivCon.
  • ...and 8 more figures