Table of Contents
Fetching ...

MagicSeg: Open-World Segmentation Pretraining via Counterfactural Diffusion-Based Auto-Generation

Kaixin Cai, Pengzhen Ren, Jianhua Han, Yi Zhu, Hang Xu, Jianzhuang Liu, Xiaodan Liang

Abstract

Open-world semantic segmentation presently relies significantly on extensive image-text pair datasets, which often suffer from a lack of fine-grained pixel annotations on sufficient categories. The acquisition of such data is rendered economically prohibitive due to the substantial investments of both human labor and time. In light of the formidable image generation capabilities of diffusion models, we introduce a novel diffusion model-driven pipeline for automatically generating datasets tailored to the needs of open-world semantic segmentation, named "MagicSeg". Our MagicSeg initiates from class labels and proceeds to generate high-fidelity textual descriptions, which in turn serve as guidance for the diffusion model to generate images. Rather than only generating positive samples for each label, our process encompasses the simultaneous generation of corresponding negative images, designed to serve as paired counterfactual samples for contrastive training. Then, to provide a self-supervised signal for open-world segmentation pretraining, our MagicSeg integrates an open-vocabulary detection model and an interactive segmentation model to extract object masks as precise segmentation labels from images based on the provided category labels. By applying our dataset to the contrastive language-image pretraining model with the pseudo mask supervision and the auxiliary counterfactual contrastive training, the downstream model obtains strong performance on open-world semantic segmentation. We evaluate our model on PASCAL VOC, PASCAL Context, and COCO, achieving SOTA with performance of 62.9%, 26.7%, and 40.2%, respectively, demonstrating our dataset's effectiveness in enhancing open-world semantic segmentation capabilities. Project website: https://github.com/ckxhp/magicseg.

MagicSeg: Open-World Segmentation Pretraining via Counterfactural Diffusion-Based Auto-Generation

Abstract

Open-world semantic segmentation presently relies significantly on extensive image-text pair datasets, which often suffer from a lack of fine-grained pixel annotations on sufficient categories. The acquisition of such data is rendered economically prohibitive due to the substantial investments of both human labor and time. In light of the formidable image generation capabilities of diffusion models, we introduce a novel diffusion model-driven pipeline for automatically generating datasets tailored to the needs of open-world semantic segmentation, named "MagicSeg". Our MagicSeg initiates from class labels and proceeds to generate high-fidelity textual descriptions, which in turn serve as guidance for the diffusion model to generate images. Rather than only generating positive samples for each label, our process encompasses the simultaneous generation of corresponding negative images, designed to serve as paired counterfactual samples for contrastive training. Then, to provide a self-supervised signal for open-world segmentation pretraining, our MagicSeg integrates an open-vocabulary detection model and an interactive segmentation model to extract object masks as precise segmentation labels from images based on the provided category labels. By applying our dataset to the contrastive language-image pretraining model with the pseudo mask supervision and the auxiliary counterfactual contrastive training, the downstream model obtains strong performance on open-world semantic segmentation. We evaluate our model on PASCAL VOC, PASCAL Context, and COCO, achieving SOTA with performance of 62.9%, 26.7%, and 40.2%, respectively, demonstrating our dataset's effectiveness in enhancing open-world semantic segmentation capabilities. Project website: https://github.com/ckxhp/magicseg.
Paper Structure (14 sections, 6 equations, 10 figures, 7 tables)

This paper contains 14 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Comparison between the previous method and MagicSeg. Previous methods DiffuMaskdatediff involved constructing dedicated segmentation datasets for specific downstream tasks. In contrast, MagicSeg leverages a large-scale vocabulary to build a dataset applied in open-world segmentation. Additionally, we generate counterfactual images to assist for segmentation task.
  • Figure 2: Visualization of synthetic image and mask from MagicSeg. With ChatGPT, we generate richly descriptive text from class names, which are used to create diverse images and corresponding masks.
  • Figure 3: The overall Counterfactual Diffusion-based Generation framework of MagicSeg: creating a large-scale open-world segmentation dataset with pixel-level annotations and counterfactual images for a wide range of categories.
  • Figure 4: The overall open-world segmentation model training framework of MagicSeg: through the category random sampling strategy and counterfactual contrastive training, we apply the constructed dataset to CLIP-based segmentation model, enhancing its capability in open-world segmentation.
  • Figure 5: Visualization of the diversity of MagicSeg's dataset with class “cat”. It can be seen that MagicSeg can generate diverse images for each category.
  • ...and 5 more figures