Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing
Fan Qi, Yu Duan, Changsheng Xu
TL;DR
This work tackles the difficulty of open-world multi-object image synthesis with diffusion models by introducing a two-stage framework that separates layout planning from image generation. Janus-Pro-driven Prompt Parsing converts prompts into structured layouts, while MIGLoRA injects spatial priors into diffusion backbones via a parameter-efficient LoRA plug-in, enabling scalable multi-instance generation. The authors validate their approach on DescripBox and standard benchmarks (COCO, LVIS), achieving state-of-the-art results with minimal additional parameters and strong layout fidelity. Together, these components offer a practical, scalable path for accurate, high-resolution, multi-object synthesis in open-world settings.
Abstract
Recent advances in text-guided diffusion models have revolutionized conditional image generation, yet they struggle to synthesize complex scenes with multiple objects due to imprecise spatial grounding and limited scalability. We address these challenges through two key modules: 1) Janus-Pro-driven Prompt Parsing, a prompt-layout parsing module that bridges text understanding and layout generation via a compact 1B-parameter architecture, and 2) MIGLoRA, a parameter-efficient plug-in integrating Low-Rank Adaptation (LoRA) into UNet (SD1.5) and DiT (SD3) backbones. MIGLoRA is capable of preserving the base model's parameters and ensuring plug-and-play adaptability, minimizing architectural intrusion while enabling efficient fine-tuning. To support a comprehensive evaluation, we create DescripBox and DescripBox-1024, benchmarks that span diverse scenes and resolutions. The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency, demonstrating superior layout fidelity and scalability for open-world synthesis.
