Table of Contents
Fetching ...

LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration

Yuyao Zhang, Jinghao Li, Yu-Wing Tai

TL;DR

LayerCraft presents a modular, three-agent system for structured text-to-image generation that leverages chain-of-thought reasoning to produce 3D-aware layouts and an image-guided inpainting network for seamless object integration. The ChainArchitect constructs background-first layouts and bounding boxes, while the Object Integration Network refines object insertions with dual-LoRA adapters and attention mixing, all orchestrated by a GPT-4o-based coordinator. Across extensive quantitative and human-evaluation benchmarks, LayerCraft achieves superior spatial coherence, object fidelity, and multi-turn editing stability, outperforming both generic diffusion models and prior agent-based approaches. The framework democratizes high-quality, controllable image synthesis and batch editing, with practical impact for creative and professional workflows, while acknowledging computational overhead and ethical considerations.

Abstract

Text-to-image (T2I) generation has made remarkable progress, yet existing systems still lack intuitive control over spatial composition, object consistency, and multi-step editing. We present $\textbf{LayerCraft}$, a modular framework that uses large language models (LLMs) as autonomous agents to orchestrate structured, layered image generation and editing. LayerCraft supports two key capabilities: (1) $\textit{structured generation}$ from simple prompts via chain-of-thought (CoT) reasoning, enabling it to decompose scenes, reason about object placement, and guide composition in a controllable, interpretable manner; and (2) $\textit{layered object integration}$, allowing users to insert and customize objects -- such as characters or props -- across diverse images or scenes while preserving identity, context, and style. The system comprises a coordinator agent, the $\textbf{ChainArchitect}$ for CoT-driven layout planning, and the $\textbf{Object Integration Network (OIN)}$ for seamless image editing using off-the-shelf T2I models without retraining. Through applications like batch collage editing and narrative scene generation, LayerCraft empowers non-experts to iteratively design, customize, and refine visual content with minimal manual effort. Code will be released at https://github.com/PeterYYZhang/LayerCraft.

LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration

TL;DR

LayerCraft presents a modular, three-agent system for structured text-to-image generation that leverages chain-of-thought reasoning to produce 3D-aware layouts and an image-guided inpainting network for seamless object integration. The ChainArchitect constructs background-first layouts and bounding boxes, while the Object Integration Network refines object insertions with dual-LoRA adapters and attention mixing, all orchestrated by a GPT-4o-based coordinator. Across extensive quantitative and human-evaluation benchmarks, LayerCraft achieves superior spatial coherence, object fidelity, and multi-turn editing stability, outperforming both generic diffusion models and prior agent-based approaches. The framework democratizes high-quality, controllable image synthesis and batch editing, with practical impact for creative and professional workflows, while acknowledging computational overhead and ethical considerations.

Abstract

Text-to-image (T2I) generation has made remarkable progress, yet existing systems still lack intuitive control over spatial composition, object consistency, and multi-step editing. We present , a modular framework that uses large language models (LLMs) as autonomous agents to orchestrate structured, layered image generation and editing. LayerCraft supports two key capabilities: (1) from simple prompts via chain-of-thought (CoT) reasoning, enabling it to decompose scenes, reason about object placement, and guide composition in a controllable, interpretable manner; and (2) , allowing users to insert and customize objects -- such as characters or props -- across diverse images or scenes while preserving identity, context, and style. The system comprises a coordinator agent, the for CoT-driven layout planning, and the for seamless image editing using off-the-shelf T2I models without retraining. Through applications like batch collage editing and narrative scene generation, LayerCraft empowers non-experts to iteratively design, customize, and refine visual content with minimal manual effort. Code will be released at https://github.com/PeterYYZhang/LayerCraft.

Paper Structure

This paper contains 25 sections, 3 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Application demonstrations for LayerCraft. Left: Demonstrates batch collage editing capabilities. A user uploads graduation photos and LayerCraft seamlessly integrates a graduation bear across all images. The system first generates a reference bear for consistency, then analyzes optimal placement while preserving facial identity and background integrity. Right: Illustrates the structured text-to-image generation process. From a simple "Alice in Wonderland" prompt, LayerCraft employs chain-of-thought reasoning to sequentially generate background elements, determine object layout, and compose the final image. The framework supports post-generation customization, as shown with the lion integration.
  • Figure 2: LayerCraft is a framework with three key components: the LayerCraft Coordinator, which processes user instructions and manages collaboration; ChainArchitect, which enhances prompts to plan layouts, identify objects and relationships, and assign bounding boxes using Chain-of-Thought reasoning; and the Object Integration Network (OIN), which enables image-guided inpainting for seamless object integration using the LoRA fine-tuned FLUX model.
  • Figure 3: Architecture of the Object Integration Network (OIN). The system processes a text prompt, a background image with a designated bounding box, and a reference object to produce a seamlessly integrated result. Red, yellow, and blue indicators represent the utilization of combined LoRA weights, background inpainting weights, and subject-driven generation weights respectively. "FF" and "MM Attn" denote feedforward layers and multi-modal attention layer in the FLUX model.
  • Figure 4: Visual comparisons with state-of-the-art generic text-to-image generation models are presented. On the left, the prompts are annotated with distinct colors to highlight critical attributes and relationships.
  • Figure 5: More example usage of LayerCraft. We can see that our model can generate results with consistent background, and object identity comparing to GPT-4o. It also illustrates the importance of pipeline's design with OIN and intermediate reference images. For GenArtist, even if we provide the grouth truth bounding boxes and extra instructions, they still failed.
  • ...and 10 more figures