Table of Contents
Fetching ...

SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis

Hanrong Ye, Jason Kuen, Qing Liu, Zhe Lin, Brian Price, Dan Xu

TL;DR

SegGen tackles the critical bottleneck of limited segmentation data by reversing the traditional data-generation pipeline: it first learns to synthesize segmentation masks from text (Text2Mask) and then generates images conditioned on those masks (Mask2Img). Two complementary strategies, MaskSyn (diverse masks and images) and ImgSyn (diverse images for real masks), enable large-scale, high-quality synthetic data without segmentation-labeler modules. Across ADE20K and COCO, SegGen yields consistent gains on semantic, panoptic, and instance segmentation and improves robustness to unseen domains, including when trained purely on synthetic data. These results demonstrate that high-quality synthetic data can approach real-data performance, reducing annotation costs and enabling better generalization in real-world scenarios.

Abstract

We propose SegGen, a highly-effective training data generation method for image segmentation, which pushes the performance limits of state-of-the-art segmentation models to a significant extent. SegGen designs and integrates two data generation strategies: MaskSyn and ImgSyn. (i) MaskSyn synthesizes new mask-image pairs via our proposed text-to-mask generation model and mask-to-image generation model, greatly improving the diversity in segmentation masks for model supervision; (ii) ImgSyn synthesizes new images based on existing masks using the mask-to-image generation model, strongly improving image diversity for model inputs. On the highly competitive ADE20K and COCO benchmarks, our data generation method markedly improves the performance of state-of-the-art segmentation models in semantic segmentation, panoptic segmentation, and instance segmentation. Notably, in terms of the ADE20K mIoU, Mask2Former R50 is largely boosted from 47.2 to 49.9 (+2.7); Mask2Former Swin-L is also significantly increased from 56.1 to 57.4 (+1.3). These promising results strongly suggest the effectiveness of our SegGen even when abundant human-annotated training data is utilized. Moreover, training with our synthetic data makes the segmentation models more robust towards unseen domains. Project website: https://seggenerator.github.io

SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis

TL;DR

SegGen tackles the critical bottleneck of limited segmentation data by reversing the traditional data-generation pipeline: it first learns to synthesize segmentation masks from text (Text2Mask) and then generates images conditioned on those masks (Mask2Img). Two complementary strategies, MaskSyn (diverse masks and images) and ImgSyn (diverse images for real masks), enable large-scale, high-quality synthetic data without segmentation-labeler modules. Across ADE20K and COCO, SegGen yields consistent gains on semantic, panoptic, and instance segmentation and improves robustness to unseen domains, including when trained purely on synthetic data. These results demonstrate that high-quality synthetic data can approach real-data performance, reducing annotation costs and enabling better generalization in real-world scenarios.

Abstract

We propose SegGen, a highly-effective training data generation method for image segmentation, which pushes the performance limits of state-of-the-art segmentation models to a significant extent. SegGen designs and integrates two data generation strategies: MaskSyn and ImgSyn. (i) MaskSyn synthesizes new mask-image pairs via our proposed text-to-mask generation model and mask-to-image generation model, greatly improving the diversity in segmentation masks for model supervision; (ii) ImgSyn synthesizes new images based on existing masks using the mask-to-image generation model, strongly improving image diversity for model inputs. On the highly competitive ADE20K and COCO benchmarks, our data generation method markedly improves the performance of state-of-the-art segmentation models in semantic segmentation, panoptic segmentation, and instance segmentation. Notably, in terms of the ADE20K mIoU, Mask2Former R50 is largely boosted from 47.2 to 49.9 (+2.7); Mask2Former Swin-L is also significantly increased from 56.1 to 57.4 (+1.3). These promising results strongly suggest the effectiveness of our SegGen even when abundant human-annotated training data is utilized. Moreover, training with our synthetic data makes the segmentation models more robust towards unseen domains. Project website: https://seggenerator.github.io
Paper Structure (21 sections, 2 equations, 15 figures, 8 tables)

This paper contains 21 sections, 2 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Effectiveness of SegGen on various data domains: Through training with synthetic data generated by the proposed SegGen, we significantly boost the performance of state-of-the-art segmentation model Mask2Former cheng2021mask2former on evaluation benchmarks including ADE20K ade20k_sceneparse_150 and COCO lin2014coco, whilst making it more robust towards challenging images from other domains (the three columns on the left are from PASCAL everingham2015pascal; the three on the right are synthesized by image generation model Kandinsky 2 kandinsky). SegGen outperforms the previous best data generation method (DiffuMask wu2023diffumask) by a huge margin when models are trained on pure synthetic data. "mIoU*" is the average IoU metric defined by wu2023diffumask which focuses on three common classes.
  • Figure 2: Comparison with previous data generation methods for segmentation:(a) Earlier methods wu2023diffumaskdatasetgan rely on segmentation labeler modules to produce segmentation masks for synthetic images. However, the performance of these downstream segmentation models, trained on synthetic data, is bounded by the capacity of the segmentation labeler modules. The segmentation labeler is a major bottleneck for the quality of the generated data. (b) We design a reverse pipeline: we first create diverse new masks from text prompts via a proposed Text2Mask generation model and then synthesize images conditioned on the segmentation masks. This methodology avoids any usage of segmentation labeler networks, resulting in significantly improved quality of our synthesis data.
  • Figure 3: Illustration of the workflow of our proposed SegGen: We introduce two generative models: a text-to-mask (Text2Mask) generation model and a mask-to-image (Mask2Img) generation model, based on which we design two approaches for synthesizing segmentation training samples: MaskSyn and ImgSyn. (a) MaskSyn focuses on generating new segmentation masks. It first extracts the caption of the real image as a text prompt and uses it to generate new masks with the Text2Mask model. Then, the new masks and text prompt are fed into the Mask2Img model to produce the corresponding new images. (b) ImgSyn focuses on the synthesis of new images. It directly inputs human-labeled masks and text prompts into the Mask2Img model to generate new images.
  • Figure 4: Generated samples by MaskSyn on ADE20K: The third row overlays the mask and the image together to demonstrate the alignment between them. The generated segmentation masks and images demonstrate high perceptual quality and excellent alignment (see more samples in supplementary materials).
  • Figure 5: Generated samples by ImgSyn on ADE20K: The generated images exhibit remarkable realism and align well with the human-labeled masks and text prompts (see more samples in supplementary materials).
  • ...and 10 more figures