Table of Contents
Fetching ...

MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis

Dewei Zhou, You Li, Fan Ma, Zongxin Yang, Yi Yang

TL;DR

The paper defines the Multi-Instance Generation (MIG) task, addressing the challenges of attribute leakage, limited instance-description modalities, and iterative consistency in generating multiple, precisely placed objects within a single image. It introduces MIGC, a divide-and-conquer controller that renders single-instance shading and merges results to prevent leakage, and MIGC++, which adds multimodal attribute control (text and image) and fine-grained position control (boxes and masks) via a Multimodal Enhance Attention and a Refined Shader. To enhance iterative MIG, the Consistent-MIG algorithm preserves unmodified regions and maintains instance identity across edits. The authors validate on COCO-MIG and Multimodal-MIG benchmarks, showing substantial gains in ISR, MIoU, AP, and text-image alignment compared with state-of-the-art baselines, and demonstrate robustness across varying instance counts and modalities.

Abstract

We introduce the Multi-Instance Generation (MIG) task, which focuses on generating multiple instances within a single image, each accurately placed at predefined positions with attributes such as category, color, and shape, strictly following user specifications. MIG faces three main challenges: avoiding attribute leakage between instances, supporting diverse instance descriptions, and maintaining consistency in iterative generation. To address attribute leakage, we propose the Multi-Instance Generation Controller (MIGC). MIGC generates multiple instances through a divide-and-conquer strategy, breaking down multi-instance shading into single-instance tasks with singular attributes, later integrated. To provide more types of instance descriptions, we developed MIGC++. MIGC++ allows attribute control through text \& images and position control through boxes \& masks. Lastly, we introduced the Consistent-MIG algorithm to enhance the iterative MIG ability of MIGC and MIGC++. This algorithm ensures consistency in unmodified regions during the addition, deletion, or modification of instances, and preserves the identity of instances when their attributes are changed. We introduce the COCO-MIG and Multimodal-MIG benchmarks to evaluate these methods. Extensive experiments on these benchmarks, along with the COCO-Position benchmark and DrawBench, demonstrate that our methods substantially outperform existing techniques, maintaining precise control over aspects including position, attribute, and quantity. Project page: https://github.com/limuloo/MIGC.

MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis

TL;DR

The paper defines the Multi-Instance Generation (MIG) task, addressing the challenges of attribute leakage, limited instance-description modalities, and iterative consistency in generating multiple, precisely placed objects within a single image. It introduces MIGC, a divide-and-conquer controller that renders single-instance shading and merges results to prevent leakage, and MIGC++, which adds multimodal attribute control (text and image) and fine-grained position control (boxes and masks) via a Multimodal Enhance Attention and a Refined Shader. To enhance iterative MIG, the Consistent-MIG algorithm preserves unmodified regions and maintains instance identity across edits. The authors validate on COCO-MIG and Multimodal-MIG benchmarks, showing substantial gains in ISR, MIoU, AP, and text-image alignment compared with state-of-the-art baselines, and demonstrate robustness across varying instance counts and modalities.

Abstract

We introduce the Multi-Instance Generation (MIG) task, which focuses on generating multiple instances within a single image, each accurately placed at predefined positions with attributes such as category, color, and shape, strictly following user specifications. MIG faces three main challenges: avoiding attribute leakage between instances, supporting diverse instance descriptions, and maintaining consistency in iterative generation. To address attribute leakage, we propose the Multi-Instance Generation Controller (MIGC). MIGC generates multiple instances through a divide-and-conquer strategy, breaking down multi-instance shading into single-instance tasks with singular attributes, later integrated. To provide more types of instance descriptions, we developed MIGC++. MIGC++ allows attribute control through text \& images and position control through boxes \& masks. Lastly, we introduced the Consistent-MIG algorithm to enhance the iterative MIG ability of MIGC and MIGC++. This algorithm ensures consistency in unmodified regions during the addition, deletion, or modification of instances, and preserves the identity of instances when their attributes are changed. We introduce the COCO-MIG and Multimodal-MIG benchmarks to evaluate these methods. Extensive experiments on these benchmarks, along with the COCO-Position benchmark and DrawBench, demonstrate that our methods substantially outperform existing techniques, maintaining precise control over aspects including position, attribute, and quantity. Project page: https://github.com/limuloo/MIGC.
Paper Structure (27 sections, 15 equations, 20 figures, 9 tables)

This paper contains 27 sections, 15 equations, 20 figures, 9 tables.

Figures (20)

  • Figure 1: Illustration of MIG. (a) SD generates images from a single image description, struggling with position control (e.g., locating a missing dog) and attribute control (e.g., incorrect hat color) in MIG. (b) MIGC ensures precise attribute and positional fidelity by using bounding boxes for spatial definitions and text for attribute definitions. (c) MIGC++ extends the framework's versatility, integrating both textual and visual descriptors for attributes and employing bounding boxes and masks to define positions. (d) Building on MIGC and MIGC++, we introduce the Consistent-MIG algorithm to bolster iterative MIG capabilities.
  • Figure 2: Comparison of the MIGC and MIGC++. (a) MIGC incorporates Instance Shaders in the U-net's mid-block and deep up-blocks during high-noise sampling to ensure positional and coarse attribute control. (b) In addition to allowing more formats of describing instances (see Fig. \ref{['fig:mig_overview']}(c)), MIGC++ introduces training-free Refined Shaders that supplant the Cross-Attention layers, enhancing accuracy in fine-grained details (e.g., better "banana" details).
  • Figure 3: Performance vs. Training Parameters. Our MIGC and MIGC++ outperformed all competitors and required the fewest parameters among methods that need training.
  • Figure 5: Enhance Attention (§\ref{['sec:ea']}, §\ref{['sec:mmea']}) enhances text embeddings to grounding embeddings, utilizing a trainable Cross-Attention layer for single-instance shading (a). The original MIGC approach can only describe an instance with text and bounding boxes. Building on this, MIGC++ expands this framework to Multimodal Enhanced Attention, enabling the description of an instance with various modalities within one generation (b).
  • Figure 6: Layout Attention (§\ref{['sec:la']}) operates akin to a Self-Attention mechanism but incorporates a layout constraint. This restriction ensures that each image token only attends to others located within the same instance region.
  • ...and 15 more figures