Table of Contents
Fetching ...

ROICtrl: Boosting Instance Control for Visual Generation

Yuchao Gu, Yipin Zhou, Yunfan Ye, Yixin Nie, Licheng Yu, Pingchuan Ma, Kevin Qinghong Lin, Mike Zheng Shou

TL;DR

This work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box paired with a free-form caption, and proposes ROICtrl, an adapter for pretrained diffusion models that enables precise regional instance control.

Abstract

Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances. To address this limitation, this work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box paired with a free-form caption. Previous methods in this area typically rely on implicit position encoding or explicit attention masks to separate regions of interest (ROIs), resulting in either inaccurate coordinate injection or large computational overhead. Inspired by ROI-Align in object detection, we introduce a complementary operation called ROI-Unpool. Together, ROI-Align and ROI-Unpool enable explicit, efficient, and accurate ROI manipulation on high-resolution feature maps for visual generation. Building on ROI-Unpool, we propose ROICtrl, an adapter for pretrained diffusion models that enables precise regional instance control. ROICtrl is compatible with community-finetuned diffusion models, as well as with existing spatial-based add-ons (\eg, ControlNet, T2I-Adapter) and embedding-based add-ons (\eg, IP-Adapter, ED-LoRA), extending their applications to multi-instance generation. Experiments show that ROICtrl achieves superior performance in regional instance control while significantly reducing computational costs.

ROICtrl: Boosting Instance Control for Visual Generation

TL;DR

This work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box paired with a free-form caption, and proposes ROICtrl, an adapter for pretrained diffusion models that enables precise regional instance control.

Abstract

Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances. To address this limitation, this work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box paired with a free-form caption. Previous methods in this area typically rely on implicit position encoding or explicit attention masks to separate regions of interest (ROIs), resulting in either inaccurate coordinate injection or large computational overhead. Inspired by ROI-Align in object detection, we introduce a complementary operation called ROI-Unpool. Together, ROI-Align and ROI-Unpool enable explicit, efficient, and accurate ROI manipulation on high-resolution feature maps for visual generation. Building on ROI-Unpool, we propose ROICtrl, an adapter for pretrained diffusion models that enables precise regional instance control. ROICtrl is compatible with community-finetuned diffusion models, as well as with existing spatial-based add-ons (\eg, ControlNet, T2I-Adapter) and embedding-based add-ons (\eg, IP-Adapter, ED-LoRA), extending their applications to multi-instance generation. Experiments show that ROICtrl achieves superior performance in regional instance control while significantly reducing computational costs.

Paper Structure

This paper contains 32 sections, 1 equation, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Grid test for instance control. (a) We structure the region positions and instance captions into a single plain caption, then prompt DALL-E 3 to generate a nine-grid image. (b) We apply ROICtrl to generate a nine-grid image based on instance captions.
  • Figure 2: Applications of ROICtrl. A trained ROICtrl adapter can extend existing diffusion models (a) and their community-finetuned versions (b) to multi-instance generation. Additionally, it can collaborate with spatial-based add-ons (c) and embedding-based add-ons (d, e) to offer fine-grained control over spatial or identity information. ROICtrl can also be applied to continuous generation settings (f). Due to legal considerations, we do not display customized results involving human identity.
  • Figure 3: Illustration of different ROI injection designs. $\lfloor\cdot\rceil$ denotes coordinate quantization to the nearest integer.
  • Figure 4: Illustration of ROI-Unpool. The dashed grid represents the spatial features, while the solid grid represents the ROI features. Similar to ROI-Align he2017mask, ROI-Unpool avoids coordinate quantization during computation.
  • Figure 5: Detailed structure of ROICtrl. In parallel with the pretrained global caption injection, we introduce an additional instance caption injection. The global attention output and instance attention output are then fused using learnable blending.
  • ...and 7 more figures