Table of Contents
Fetching ...

InstanceDiffusion: Instance-level Control for Image Generation

Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, Ishan Misra

TL;DR

InstanceDiffusion addresses the lack of fine-grained instance control in text-to-image diffusion by introducing UniFusion for unified instance conditioning, ScaleU for fidelity to layouts, and Multi-instance Sampler to reduce cross-instance leakage. It supports multiple location formats (points, scribbles, boxes, masks) and per-instance captions, enabling precise, flexible scene composition. The approach achieves state-of-the-art performance on COCO and LVIS across several conditioned inputs and demonstrates strong attribute binding, as well as iterative editing capabilities. The work provides a practical framework for instance-level controllable generation with broad applicability in design and data synthesis.

Abstract

Text-to-image diffusion models produce high quality images but do not offer control over individual instances in the image. We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points, scribbles, bounding boxes or intricate instance segmentation masks, and combinations thereof. We propose three major changes to text-to-image models that enable precise instance-level control. Our UniFusion block enables instance-level conditions for text-to-image models, the ScaleU block improves image fidelity, and our Multi-instance Sampler improves generations for multiple instances. InstanceDiffusion significantly surpasses specialized state-of-the-art models for each location condition. Notably, on the COCO dataset, we outperform previous state-of-the-art by 20.4% AP$_{50}^\text{box}$ for box inputs, and 25.4% IoU for mask inputs.

InstanceDiffusion: Instance-level Control for Image Generation

TL;DR

InstanceDiffusion addresses the lack of fine-grained instance control in text-to-image diffusion by introducing UniFusion for unified instance conditioning, ScaleU for fidelity to layouts, and Multi-instance Sampler to reduce cross-instance leakage. It supports multiple location formats (points, scribbles, boxes, masks) and per-instance captions, enabling precise, flexible scene composition. The approach achieves state-of-the-art performance on COCO and LVIS across several conditioned inputs and demonstrates strong attribute binding, as well as iterative editing capabilities. The work provides a practical framework for instance-level controllable generation with broad applicability in design and data synthesis.

Abstract

Text-to-image diffusion models produce high quality images but do not offer control over individual instances in the image. We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points, scribbles, bounding boxes or intricate instance segmentation masks, and combinations thereof. We propose three major changes to text-to-image models that enable precise instance-level control. Our UniFusion block enables instance-level conditions for text-to-image models, the ScaleU block improves image fidelity, and our Multi-instance Sampler improves generations for multiple instances. InstanceDiffusion significantly surpasses specialized state-of-the-art models for each location condition. Notably, on the COCO dataset, we outperform previous state-of-the-art by 20.4% AP for box inputs, and 25.4% IoU for mask inputs.
Paper Structure (18 sections, 5 equations, 13 figures, 12 tables)

This paper contains 18 sections, 5 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: InstanceDiffusion enhances text-to-image models by providing additional instance-level control. In additon to a global text prompt, InstanceDiffusion allows for paired instance-level prompts and their locations to be specified when generating images. InstanceDiffusion is versatile, supporting a range of location forms, from the simplest points, boxes, and scribbles to more complex masks, and their flexible combinations.
  • Figure 2: UniFusion projects various forms of instance-level conditions into the same feature space, seamlessly incorporating instance-level locations and text-prompts into the visual tokens from the diffusion backbone.
  • Figure 3: We represent different location condition formats as sets of points, with each format having varying quantities of points. Masks are represented as sparsely sampled points within the mask and uniformly sampled points from boundary polygons, bounding boxes by the top-right and bottom-right corners, and scribble are converted into uniformly sampled points.
  • Figure 4: Model inference with Multi-instance Sampler to minimize information leakage across multiple instance conditionings.
  • Figure 5: Qualitative comparison of InstanceDiffusion vs. GLIGEN conditioned on multiple instance boxes and prompts. Prior work (bottom row) fails to accurately reflect specific instance attributes, e.g., colors for the flower and puppies on the left, and not depicting a waterfall on the right. The generations also do not capture the correct instances, and are prone to information leakage across the instance prompts, e.g., generating two similar instances on the right. InstanceDiffusion effectively mitigates these issues.
  • ...and 8 more figures