Table of Contents
Fetching ...

DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data

Yuanpeng Tu, Xi Chen, Ser-Nam Lim, Hengshuang Zhao

TL;DR

The paper addresses the limited generalization of open-vocabulary panoptic segmentation to novel classes by introducing DreamMask, a data-centric framework that blends LLM-driven vocabulary expansion with context-aware synthetic sample generation through layout-to-image diffusion. A two-stage NSS/IAT pipeline generates high-quality, richly annotated synthetic samples and aligns their representations with real data using a synthetic-real alignment loss and online class-wise prototypes. Experiments demonstrate consistent gains across open- and close-vocabulary benchmarks, with notable improvement over state-of-the-art methods (e.g., 2.1% mIoU on ADE20K when trained on COCO) and superiority over web-crawled data. DreamMask acts as a plug-and-play enhancement to existing OPS models, highlighting the practical impact of leveraging synthetic, context-aware data for open-vocabulary segmentation and potentially other open-vocabulary vision tasks.

Abstract

Open-vocabulary panoptic segmentation has received significant attention due to its applicability in the real world. Despite claims of robust generalization, we find that the advancements of previous works are attributed mainly on trained categories, exposing a lack of generalization to novel classes. In this paper, we explore boosting existing models from a data-centric perspective. We propose DreamMask, which systematically explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data. For the first part, we propose an automatic data generation pipeline with off-the-shelf models. We propose crucial designs for vocabulary expansion, layout arrangement, data filtering, etc. Equipped with these techniques, our generated data could significantly outperform the manually collected web data. To train the model with generated data, a synthetic-real alignment loss is designed to bridge the representation gap, bringing noticeable improvements across multiple benchmarks. In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods. For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.

DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data

TL;DR

The paper addresses the limited generalization of open-vocabulary panoptic segmentation to novel classes by introducing DreamMask, a data-centric framework that blends LLM-driven vocabulary expansion with context-aware synthetic sample generation through layout-to-image diffusion. A two-stage NSS/IAT pipeline generates high-quality, richly annotated synthetic samples and aligns their representations with real data using a synthetic-real alignment loss and online class-wise prototypes. Experiments demonstrate consistent gains across open- and close-vocabulary benchmarks, with notable improvement over state-of-the-art methods (e.g., 2.1% mIoU on ADE20K when trained on COCO) and superiority over web-crawled data. DreamMask acts as a plug-and-play enhancement to existing OPS models, highlighting the practical impact of leveraging synthetic, context-aware data for open-vocabulary segmentation and potentially other open-vocabulary vision tasks.

Abstract

Open-vocabulary panoptic segmentation has received significant attention due to its applicability in the real world. Despite claims of robust generalization, we find that the advancements of previous works are attributed mainly on trained categories, exposing a lack of generalization to novel classes. In this paper, we explore boosting existing models from a data-centric perspective. We propose DreamMask, which systematically explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data. For the first part, we propose an automatic data generation pipeline with off-the-shelf models. We propose crucial designs for vocabulary expansion, layout arrangement, data filtering, etc. Equipped with these techniques, our generated data could significantly outperform the manually collected web data. To train the model with generated data, a synthetic-real alignment loss is designed to bridge the representation gap, bringing noticeable improvements across multiple benchmarks. In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods. For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
Paper Structure (12 sections, 4 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 12 sections, 4 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of class IoU on novel classes in ADE20K zhou2017scene (model trained on COCO caesar2018coco) between FC-CLIP yu2023fcclip and our FC-CLIP+DreamMask. Our method outperforms the baseline by a large margin on the accuracy of novel categories that have no similar semantics to the training categories of the COCO dataset. Detailed results are shown in the supplement.
  • Figure 2: The overall framework of DreamMask, including Novel Sample Synthesis (NSS) and Imagination Aided Training (IAT). The former consists of Category Name Association and Context-aware Sample Synthesis (CSS). Specifically, CNA targets at extending novel class names with the powerful association abilities of LLMs, and CSS generates high-quality samples and corresponding pixel-level annotations with layout-to-image diffusion models and the SAM. Finally, in IAT, these samples are utilized to augment the training set and a synthetic-real alignment loss is introduced to alleviate the influence of domain shift by enclosing the representation between synthetic and realistic objects.
  • Figure 3: Examples of extended novel categories. Diverse class names can be generated in category name association with the powerful reasoning abilities of LLMs.
  • Figure 4: Context-aware generation. The class names are sampled from $\mathcal{C}_{novel}$$\cup$$\mathcal{C}_{train}$ and then fed into the LLMs to produce coarse-grained layout descriptions. Then such results are fed into the LLMs again for fine-grained visual planning. A layout-to-image diffusion model feng2023layoutgpt is followed for high-quality synthesis.
  • Figure 5: Examples of the text descriptions, layouts and corresponding synthetic samples. The LLMs can effectively help generate samples with realistic layouts.
  • ...and 3 more figures