EliGen: Entity-Level Controlled Image Generation with Regional Attention
Hong Zhang, Zhongjie Duan, Xingjun Wang, Yingda Chen, Yu Zhang
TL;DR
EliGen tackles the challenge of fine-grained, entity-level control in text-to-image generation by introducing regional attention that extends diffusion transformers to handle arbitrary-shaped spatial masks without adding new parameters. A large, annotated dataset enables supervised fine-tuning via LoRA, yielding robust and precise control over multiple entities, including their layouts and details, as well as a novel inpainting fusion pipeline for multi-entity edits. The model demonstrates strong performance on COCO with high entity fidelity, spatial accuracy, and image quality, and shows clear benefits in human preferences compared with prior methods. Additionally, EliGen integrates with open-source tools like IP-Adapter, In-Context LoRA, and MLLM to enable styled entity control, entity transfer, and dialogue-driven design, highlighting its practical potential for advanced image synthesis and editing.
Abstract
Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-level controlled image Generation. Firstly, we put forward regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both spatial precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending its capabilities to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with other open-source models such as IP-Adapter, In-Context LoRA and MLLM, unlocking new creative possibilities. The source code, model, and dataset are published at https://github.com/modelscope/DiffSynth-Studio.git.
