Table of Contents
Fetching ...

Mask Factory: Towards High-quality Synthetic Data Generation for Dichotomous Image Segmentation

Haotian Qian, YD Chen, Shengtao Lou, Fahad Shahbaz Khan, Xiaogang Jin, Deng-Ping Fan

TL;DR

MaskFactory addresses the challenge of producing high-quality, diverse, and precisely labeled DIS data at scale. It introduces a two-stage pipeline that first edits binary masks with rigid and non-rigid transformations guided by geometric priors and topology-preserving adversarial training, then generates aligned RGB images conditioned on masks and canny edges via a multi-conditional diffusion framework. The approach yields superior structural fidelity and efficiency on the DIS5K benchmark, outperforming diffusion-based baselines and generalizing across multiple segmentation networks. By reducing annotation time and costs while preserving fine-grained DIS details, MaskFactory enables broader deployment of DIS models in real-world applications.

Abstract

Dichotomous Image Segmentation (DIS) tasks require highly precise annotations, and traditional dataset creation methods are labor intensive, costly, and require extensive domain expertise. Although using synthetic data for DIS is a promising solution to these challenges, current generative models and techniques struggle with the issues of scene deviations, noise-induced errors, and limited training sample variability. To address these issues, we introduce a novel approach, \textbf{\ourmodel{}}, which provides a scalable solution for generating diverse and precise datasets, markedly reducing preparation time and costs. We first introduce a general mask editing method that combines rigid and non-rigid editing techniques to generate high-quality synthetic masks. Specially, rigid editing leverages geometric priors from diffusion models to achieve precise viewpoint transformations under zero-shot conditions, while non-rigid editing employs adversarial training and self-attention mechanisms for complex, topologically consistent modifications. Then, we generate pairs of high-resolution image and accurate segmentation mask using a multi-conditional control generation method. Finally, our experiments on the widely-used DIS5K dataset benchmark demonstrate superior performance in quality and efficiency compared to existing methods. The code is available at \url{https://qian-hao-tian.github.io/MaskFactory/}.

Mask Factory: Towards High-quality Synthetic Data Generation for Dichotomous Image Segmentation

TL;DR

MaskFactory addresses the challenge of producing high-quality, diverse, and precisely labeled DIS data at scale. It introduces a two-stage pipeline that first edits binary masks with rigid and non-rigid transformations guided by geometric priors and topology-preserving adversarial training, then generates aligned RGB images conditioned on masks and canny edges via a multi-conditional diffusion framework. The approach yields superior structural fidelity and efficiency on the DIS5K benchmark, outperforming diffusion-based baselines and generalizing across multiple segmentation networks. By reducing annotation time and costs while preserving fine-grained DIS details, MaskFactory enables broader deployment of DIS models in real-world applications.

Abstract

Dichotomous Image Segmentation (DIS) tasks require highly precise annotations, and traditional dataset creation methods are labor intensive, costly, and require extensive domain expertise. Although using synthetic data for DIS is a promising solution to these challenges, current generative models and techniques struggle with the issues of scene deviations, noise-induced errors, and limited training sample variability. To address these issues, we introduce a novel approach, \textbf{\ourmodel{}}, which provides a scalable solution for generating diverse and precise datasets, markedly reducing preparation time and costs. We first introduce a general mask editing method that combines rigid and non-rigid editing techniques to generate high-quality synthetic masks. Specially, rigid editing leverages geometric priors from diffusion models to achieve precise viewpoint transformations under zero-shot conditions, while non-rigid editing employs adversarial training and self-attention mechanisms for complex, topologically consistent modifications. Then, we generate pairs of high-resolution image and accurate segmentation mask using a multi-conditional control generation method. Finally, our experiments on the widely-used DIS5K dataset benchmark demonstrate superior performance in quality and efficiency compared to existing methods. The code is available at \url{https://qian-hao-tian.github.io/MaskFactory/}.

Paper Structure

This paper contains 36 sections, 14 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Shows the edited masks from the first stage and the corresponding images generated in the second stage. In the examples, we transformed the viewpoint of park benches and tables from a frontal view to a top-down view and edited their shapes, changing park benches from curved to square edges and tables from square to circular shapes.
  • Figure 2: Workflow of MaskFactory. In the first stage, we generate new masks by applying rigid and non-rigid editing to the existing ground truth masks. In the second stage, we use the generated masks and their corresponding extracted Canny edges as conditions, along with a prompt representing the category, to generate RGB images. This process forms paired data for our generative model.
  • Figure 3: compared with baseline methods
  • Figure 4: UMAP Distribution Differences
  • Figure 5: Visual results of common object mask editing. The model demonstrates strong topological structure preservation and diverse editing outcomes with both rigid and non-rigid edits.
  • ...and 3 more figures