MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

Jiahao Xie; Wei Li; Xiangtai Li; Ziwei Liu; Yew Soon Ong; Chen Change Loy

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

Jiahao Xie, Wei Li, Xiangtai Li, Ziwei Liu, Yew Soon Ong, Chen Change Loy

TL;DR

Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories.

Abstract

We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code: https://github.com/Jiahao000/MosaicFusion.

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 5 figures, 13 tables)

This paper contains 17 sections, 1 equation, 5 figures, 13 tables.

Introduction
Related Work
MosaicFusion
Preliminary
Image Generation
Mask Generation
Experiments
Implementation Details
Datasets
Evaluation Metrics
MosaicFusion
Baseline Settings
Main Properties
Comparison with Previous Methods
Further Discussion
...and 2 more sections

Figures (5)

Figure 1: Long-tailed and open-vocabulary instance segmentation on LVIS gupta2019lvis using our MosaicFusion data augmentation approach, which can generate meaningful synthetic labeled data for both rare and novel categories without further training and label supervision. We evaluate the model with the standard mask AP (i.e., AP$_{\text{r}}$ and AP$_{\text{novel}}$). MosaicFusion provides strong gains on all considered baseline methods (e.g., Mask R-CNN he2017mask with ResNet-50, Box-Supervised CenterNet2 detic with Swin-B, F-VLM kuo2023f with ResNet-50 and ResNet-50x64)
Figure 2: Overview of our MosaicFusion pipeline. The left part shows the image generation process, while the right part shows the mask generation process. Given a user-defined Mosaic image canvas and a set of text prompts, we first map the image canvas from the pixel space into the latent space. We then run the diffusion process on each latent region parallelly with the shared noise prediction model, starting from the same initialization noise while conditioning on different text prompts, to generate the synthetic image with multiple objects specified in each region. Simultaneously, we aggregate the region-wise cross-attention maps for each subject token by upscaling them to the original region size in the pixel space and averaging them across all attention heads, layers, and time steps. After that, we binarize the aggregated attention maps, refine the boundaries, filter out the low-quality masks, and expand them to the size of the whole image canvas to obtain the final instance masks
Figure 3: Visualization of cross-attention maps with respect to each interest subject word across different time steps and layers in the diffusion process. The time steps range from the first step $t=50$ to the last step $t=1$ in equal intervals (from left to right), while the layer resolutions range from $\times 1/32$ to $\times 1/8$ of the original image size (from top to bottom). In each entry, the last column shows the averaged cross-attention maps across different time steps, while the last row shows the averaged cross-attention maps across different layers. The highest-quality attention maps are produced by averaging them across both time steps and layers (bottom right framed in red)
Figure 4: Visualization of our synthesized instance segmentation dataset by MosaicFusion. We show examples of generating $N=1,2,4$ objects per image, using the settings in Table \ref{['tab:ablation-lt-object']}
Figure 5: Visualization of failure cases during synthesis. MosaicFusion fails in some objects that are i) entangled (e.g., "spice rack" with spices, "chessboard" with chess), ii) small (e.g., "legume", "tinsel"), and iii) abstract (e.g., "hardback book")

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

TL;DR

Abstract

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)