DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control

Shiyan Du; Conghan Yue; Xinyu Cheng; Dongyu Zhang

DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control

Shiyan Du, Conghan Yue, Xinyu Cheng, Dongyu Zhang

TL;DR

The proposed DEIG, a novel framework for fine-grained and controllable multi-instance generation, functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.

Abstract

Multi-Instance Generation has advanced significantly in spatial placement and attribute binding. However, existing approaches still face challenges in fine-grained semantic understanding, particularly when dealing with complex textual descriptions. To overcome these limitations, we propose DEIG, a novel framework for fine-grained and controllable multi-instance generation. DEIG integrates an Instance Detail Extractor (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a Detail Fusion Module (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. These components enable DEIG to generate visually coherent multi-instance scenes that precisely match rich, localized textual descriptions. To support fine-grained supervision, we construct a high-quality dataset with detailed, compositional instance captions generated by VLMs. We also introduce DEIG-Bench, a new benchmark with region-level annotations and multi-attribute prompts for both humans and objects. Experiments demonstrate that DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. Moreover, DEIG functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.

DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control

TL;DR

Abstract

Paper Structure (22 sections, 8 equations, 12 figures, 4 tables)

This paper contains 22 sections, 8 equations, 12 figures, 4 tables.

Introduction
Related Work
Controllable Diffusion models
Multi-Instance Generation
Training-Free Methods
Training-Based Methods
Method
Overview
Instance-Level Semantic Enhancement
Instance Detail Extractor
Detail Fusion Module
Grounding Embeddings Broadcast
Instance-based Masked Attention
Detail-Enriched Instance Captions
Experiment
...and 7 more sections

Figures (12)

Figure 1: Fine-Grained Generation.Given bounding boxes and detailed descriptions, our method accurately generate multi-attribute instances, while existing methods fail to preserve fine-grained semantic details.
Figure 2: Overview of the DEIG pipeline. (a) DEIG enables the use of a frozen large text encoder to extract raw instance embeddings, which are refined by the IDE and fused into the UNet via the DFM for fine-grained control. (b) Structure of the IDE, which refines learnable queries via time-aware self- and cross-attention to produce compact instance embeddings.
Figure 3: Workflow Visualization of Fine-Grained Instance Generation. (a) Instance-based masked attention mechanism divides the attention map into four sub-regions, applying masks to restrict cross-instance interactions and prevent semantic leakage. (b) Visualization of aggregated semantic embeddings across different semantic dimensions.
Figure 4: Detail-Enriched caption generation pipeline. Instances are described by a VLM from cropped images and filtered using CLIP scores and human review.
Figure 5: Qualitative comparison on DEIG-Bench. DEIG exhibits accurate generation of fine-grained, multi-attribute instances across varying levels of complexity, demonstrating superior compositional control and semantic alignment.
...and 7 more figures

DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control

TL;DR

Abstract

DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control

Authors

TL;DR

Abstract

Table of Contents

Figures (12)