Table of Contents
Fetching ...

Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing

Fan Qi, Yu Duan, Changsheng Xu

TL;DR

This work tackles the difficulty of open-world multi-object image synthesis with diffusion models by introducing a two-stage framework that separates layout planning from image generation. Janus-Pro-driven Prompt Parsing converts prompts into structured layouts, while MIGLoRA injects spatial priors into diffusion backbones via a parameter-efficient LoRA plug-in, enabling scalable multi-instance generation. The authors validate their approach on DescripBox and standard benchmarks (COCO, LVIS), achieving state-of-the-art results with minimal additional parameters and strong layout fidelity. Together, these components offer a practical, scalable path for accurate, high-resolution, multi-object synthesis in open-world settings.

Abstract

Recent advances in text-guided diffusion models have revolutionized conditional image generation, yet they struggle to synthesize complex scenes with multiple objects due to imprecise spatial grounding and limited scalability. We address these challenges through two key modules: 1) Janus-Pro-driven Prompt Parsing, a prompt-layout parsing module that bridges text understanding and layout generation via a compact 1B-parameter architecture, and 2) MIGLoRA, a parameter-efficient plug-in integrating Low-Rank Adaptation (LoRA) into UNet (SD1.5) and DiT (SD3) backbones. MIGLoRA is capable of preserving the base model's parameters and ensuring plug-and-play adaptability, minimizing architectural intrusion while enabling efficient fine-tuning. To support a comprehensive evaluation, we create DescripBox and DescripBox-1024, benchmarks that span diverse scenes and resolutions. The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency, demonstrating superior layout fidelity and scalability for open-world synthesis.

Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing

TL;DR

This work tackles the difficulty of open-world multi-object image synthesis with diffusion models by introducing a two-stage framework that separates layout planning from image generation. Janus-Pro-driven Prompt Parsing converts prompts into structured layouts, while MIGLoRA injects spatial priors into diffusion backbones via a parameter-efficient LoRA plug-in, enabling scalable multi-instance generation. The authors validate their approach on DescripBox and standard benchmarks (COCO, LVIS), achieving state-of-the-art results with minimal additional parameters and strong layout fidelity. Together, these components offer a practical, scalable path for accurate, high-resolution, multi-object synthesis in open-world settings.

Abstract

Recent advances in text-guided diffusion models have revolutionized conditional image generation, yet they struggle to synthesize complex scenes with multiple objects due to imprecise spatial grounding and limited scalability. We address these challenges through two key modules: 1) Janus-Pro-driven Prompt Parsing, a prompt-layout parsing module that bridges text understanding and layout generation via a compact 1B-parameter architecture, and 2) MIGLoRA, a parameter-efficient plug-in integrating Low-Rank Adaptation (LoRA) into UNet (SD1.5) and DiT (SD3) backbones. MIGLoRA is capable of preserving the base model's parameters and ensuring plug-and-play adaptability, minimizing architectural intrusion while enabling efficient fine-tuning. To support a comprehensive evaluation, we create DescripBox and DescripBox-1024, benchmarks that span diverse scenes and resolutions. The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency, demonstrating superior layout fidelity and scalability for open-world synthesis.

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Stage $I$ :Layout Understanding, the model extracts spatial information and infers object bounding boxes from input images and textual subprompts. Draft Generation employs an inverse training objective to synthesize structured layouts and object descriptors, ensuring alignment between textual prompts and spatial configurations.
  • Figure 2: Stage $II$: (a) UNet-based architecture: The bounding box encoder generates mask latents, which are concatenated with VAE-encoded image latents to form layout latents. Each layout latent is processed separately through the UNet encoder, requiring multiple passes for multiple bounding boxes. (b) DiT-based architecture: All layout latents are simultaneously fed into the DiT architecture, improving model efficiency.
  • Figure 3: Qualitative comparison with SOTA methods on COCO val 512×512. Compared to baseline methods (CAG 14, MtDM 11, MIGC 33, InstanceDiff 36, GLIGEN 34, and HiCo 37), MIGLoRA(SD1.5) demonstrates superior performance in composing multiple independent concepts ($\geq 4$ objects) while maintaining better spatial relationships and visual quality.
  • Figure 4: Qualitative comparison with SOTA method on DescripBox-Val. Compared to CreatiLayout zhang2024creatilayoutsiamesemultimodaldiffusion, our model uses fine-tuning of Stable Diffusion 3 to generate high-quality 1024 $\times$ 1024 images in the task of layout-based image generation.
  • Figure 5: The experimental results of MIGLoRAJP(SDXL) show that our model can generate satisfactory images in complex scenarios.