Table of Contents
Fetching ...

DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control

Shiyan Du, Conghan Yue, Xinyu Cheng, Dongyu Zhang

TL;DR

The proposed DEIG, a novel framework for fine-grained and controllable multi-instance generation, functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.

Abstract

Multi-Instance Generation has advanced significantly in spatial placement and attribute binding. However, existing approaches still face challenges in fine-grained semantic understanding, particularly when dealing with complex textual descriptions. To overcome these limitations, we propose DEIG, a novel framework for fine-grained and controllable multi-instance generation. DEIG integrates an Instance Detail Extractor (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a Detail Fusion Module (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. These components enable DEIG to generate visually coherent multi-instance scenes that precisely match rich, localized textual descriptions. To support fine-grained supervision, we construct a high-quality dataset with detailed, compositional instance captions generated by VLMs. We also introduce DEIG-Bench, a new benchmark with region-level annotations and multi-attribute prompts for both humans and objects. Experiments demonstrate that DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. Moreover, DEIG functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.

DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control

TL;DR

The proposed DEIG, a novel framework for fine-grained and controllable multi-instance generation, functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.

Abstract

Multi-Instance Generation has advanced significantly in spatial placement and attribute binding. However, existing approaches still face challenges in fine-grained semantic understanding, particularly when dealing with complex textual descriptions. To overcome these limitations, we propose DEIG, a novel framework for fine-grained and controllable multi-instance generation. DEIG integrates an Instance Detail Extractor (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a Detail Fusion Module (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. These components enable DEIG to generate visually coherent multi-instance scenes that precisely match rich, localized textual descriptions. To support fine-grained supervision, we construct a high-quality dataset with detailed, compositional instance captions generated by VLMs. We also introduce DEIG-Bench, a new benchmark with region-level annotations and multi-attribute prompts for both humans and objects. Experiments demonstrate that DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. Moreover, DEIG functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.
Paper Structure (22 sections, 8 equations, 12 figures, 4 tables)

This paper contains 22 sections, 8 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Fine-Grained Generation.Given bounding boxes and detailed descriptions, our method accurately generate multi-attribute instances, while existing methods fail to preserve fine-grained semantic details.
  • Figure 2: Overview of the DEIG pipeline. (a) DEIG enables the use of a frozen large text encoder to extract raw instance embeddings, which are refined by the IDE and fused into the UNet via the DFM for fine-grained control. (b) Structure of the IDE, which refines learnable queries via time-aware self- and cross-attention to produce compact instance embeddings.
  • Figure 3: Workflow Visualization of Fine-Grained Instance Generation. (a) Instance-based masked attention mechanism divides the attention map into four sub-regions, applying masks to restrict cross-instance interactions and prevent semantic leakage. (b) Visualization of aggregated semantic embeddings across different semantic dimensions.
  • Figure 4: Detail-Enriched caption generation pipeline. Instances are described by a VLM from cropped images and filtered using CLIP scores and human review.
  • Figure 5: Qualitative comparison on DEIG-Bench. DEIG exhibits accurate generation of fine-grained, multi-attribute instances across varying levels of complexity, demonstrating superior compositional control and semantic alignment.
  • ...and 7 more figures