MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Dewei Zhou; You Li; Fan Ma; Xiaoting Zhang; Yi Yang

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Dewei Zhou, You Li, Fan Ma, Xiaoting Zhang, Yi Yang

TL;DR

This work defines Multi-Instance Generation (MIG) and introduces MIGC, a divide-and-conquer framework that decomposes MIG into per-instance shading tasks, augments shading with an Enhancement Attention layer, and harmonizes results via Layout Attention and a Shading Aggregation Controller. It also provides COCO-MIG, a benchmark assessing position, quantity, and attribute control, and demonstrates substantial gains in Instance Success Rate and spatial accuracy across COCO-MIG, COCO-Position, and DrawBench with inference speed close to Stable Diffusion. Key technical contributions include grounded phrase tokens for disambiguating instances, Enhancement Attention to mitigate missing instances, and a dynamic shading aggregation strategy that preserves image coherence. The results indicate MIGC significantly improves multi-instance control while maintaining quality, suggesting practical applicability for complex scene generation with precise object-level constraints.

Abstract

We present a Multi-Instance Generation (MIG) task, simultaneously generating multiple instances with diverse controls in one image. Given a set of predefined coordinates and their corresponding descriptions, the task is to ensure that generated instances are accurately at the designated locations and that all instances' attributes adhere to their corresponding description. This broadens the scope of current research on Single-instance generation, elevating it to a more versatile and practical dimension. Inspired by the idea of divide and conquer, we introduce an innovative approach named Multi-Instance Generation Controller (MIGC) to address the challenges of the MIG task. Initially, we break down the MIG task into several subtasks, each involving the shading of a single instance. To ensure precise shading for each instance, we introduce an instance enhancement attention mechanism. Lastly, we aggregate all the shaded instances to provide the necessary information for accurately generating multiple instances in stable diffusion (SD). To evaluate how well generation models perform on the MIG task, we provide a COCO-MIG benchmark along with an evaluation pipeline. Extensive experiments were conducted on the proposed COCO-MIG benchmark, as well as on various commonly used benchmarks. The evaluation results illustrate the exceptional control capabilities of our model in terms of quantity, position, attribute, and interaction. Code and demos will be released at https://migcproject.github.io/.

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

TL;DR

Abstract

Paper Structure (31 sections, 11 equations, 16 figures, 5 tables)

This paper contains 31 sections, 11 equations, 16 figures, 5 tables.

Introduction
Related work
Text-to-Image Generation
Layout-to-Image Generation
Method
Preliminaries
Overview
Divide MIG into Instance Shading Subtasks
Conquer Instance Shading
Combine Shading Results
Summary
Experiments
Benchmarks
Evaluation Metrics
Baselines
...and 16 more sections

Figures (16)

Figure 1: An example from the COCO-MIG benchmark. (a) In this example, COCO-MIG requires generation models to generate "donuts" of various colors according to the specified positions and color attributes. (b) Although the state-of-the-art layout-to-image method GLIGEN can generate "donuts" according to the specified position in this example, their color attributes are not correct. We use boxes with "Attr" to mark the wrong color attributes. (c) Our proposed MIGC can not only generate "donuts" according to the position specified by the annotation but also ensure that the color attribute of each generated donut instance is correct.
Figure 2: Overview of our MIGC. Stable diffusion's UNet inputs text description and image features into the Cross-Attention layer to obtain the residual feature and then adds it to the image features to determine generated content, which is like a shading process (i.e., coloring with parallel pencil lines or a block of color). In this view, MIG can be considered multi-instance shading on image features, and MIGC comprises three steps: (a) Divide MIG into single-instance shading subtasks. (b) Conquer single-instance shading with Enhancement Attention. (c) Combine shading results through Layout Attention and Shading Aggregation Controller.
Figure 2: Multi-Instance Generation (MIG) with our MIGC. MIGC can generate images based on various complex layouts and ensure that the attributes of each instance are correct.
Figure 3: Three main modules in MIGC. (a) Architecture of Enhancement Attention Layer. (b) Architecture of Layout Attention Layer. (c) Architecture of Shading Aggregation Controller.
Figure 3: Multi-Instance Generation (MIG) with our MIGC. By specifying the relation between instances through the global prompt, MIGC can further control the interaction of instances.
...and 11 more figures

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

TL;DR

Abstract

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (16)