EliGen: Entity-Level Controlled Image Generation with Regional Attention

Hong Zhang; Zhongjie Duan; Xingjun Wang; Yingda Chen; Yu Zhang

EliGen: Entity-Level Controlled Image Generation with Regional Attention

Hong Zhang, Zhongjie Duan, Xingjun Wang, Yingda Chen, Yu Zhang

TL;DR

EliGen tackles the challenge of fine-grained, entity-level control in text-to-image generation by introducing regional attention that extends diffusion transformers to handle arbitrary-shaped spatial masks without adding new parameters. A large, annotated dataset enables supervised fine-tuning via LoRA, yielding robust and precise control over multiple entities, including their layouts and details, as well as a novel inpainting fusion pipeline for multi-entity edits. The model demonstrates strong performance on COCO with high entity fidelity, spatial accuracy, and image quality, and shows clear benefits in human preferences compared with prior methods. Additionally, EliGen integrates with open-source tools like IP-Adapter, In-Context LoRA, and MLLM to enable styled entity control, entity transfer, and dialogue-driven design, highlighting its practical potential for advanced image synthesis and editing.

Abstract

Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-level controlled image Generation. Firstly, we put forward regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both spatial precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending its capabilities to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with other open-source models such as IP-Adapter, In-Context LoRA and MLLM, unlocking new creative possibilities. The source code, model, and dataset are published at https://github.com/modelscope/DiffSynth-Studio.git.

EliGen: Entity-Level Controlled Image Generation with Regional Attention

TL;DR

Abstract

Paper Structure (45 sections, 11 equations, 18 figures, 3 tables)

This paper contains 45 sections, 11 equations, 18 figures, 3 tables.

Introduction
Related Work
Text-to-Image Diffusion Models.
Entity-Level Controlled Generation.
Approach
Preliminaries
Problem Definition
Motivation
Regional Attention
Dataset with Entity Annotation
Implementation Details
Qualitative Experiments
Entity-Level Controlled Generation
Rectangular Masks as Input
Arbitrary Masks as Input
...and 30 more sections

Figures (18)

Figure 1: Entity control ability of EliGen. The global prompt is "top-down view of a desk, laptop, a pot of rose, and a book." (a) No entity control. (b) Untrained regional attention modifies regional details (entity "rose") but lacks position control ability (entities "laptop" and "book"). (c) After training, EliGen successfully achieves control over all entities.
Figure 2: EliGen enables spatial and semantic control of each entity. (a) By incorporating local prompts and masks for each entity, it generates images with specific layouts and details. Unlike previous models restricted to rectangular controls, EliGen supports arbitrary-shaped masks, facilitating more creative generation. (b) Additionally, it performs image inpainting with input images. Notably, our model demonstrates robust generalization, consistently producing ideal layouts across different seeds, with detailed experiments in the Supplementary Material.
Figure 3: The regional attention mechanism within the double-stream transformer block of DiT. (a) The diffusion model. (b) The global and local prompts are encoded and concatenated with the latent embeddings $z$ to form the attention sequence. (c) The attention mask $\mathrm{M}$ is constructed from multiple components, each defining the specific region for which each sequence token should perform attention. In the composed mask $\mathrm{M}$, all colored regions indicate 1, and gray areas indicate 0.
Figure 4: Qualitative results conditioned on multiple rectangular-shaped entities. Test case combinations evolve from simple to complex, with the final two rows illustrating enhanced graph quality and coherence of our model in the presence of entity coupling.
Figure 5: Qualitative results with arbitrary-shaped entities.
...and 13 more figures

EliGen: Entity-Level Controlled Image Generation with Regional Attention

TL;DR

Abstract

EliGen: Entity-Level Controlled Image Generation with Regional Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (18)